Skip to main content
engineer · ai researcher · systems

Cheng Chang

I train language models from scratch on free-tier CPUs, write chess engines in C++23, and build post-transformer architectures on Ascend NPUs — no GPUs, no pretraining, everything from first principles.

telemetry
67repos
240stars
12languages
live
scroll

/ about

I build things from scratch to understand them — then document what breaks.

I'm a self-taught engineer who treats research like engineering and engineering like research. My work spans training language models on free-tier CPUs, writing a from-scratch chess engine, designing post-transformer architectures for Ascend NPUs, and reverse-engineering Windows games — all to find the edges of what a single person on commodity hardware can actually do.

What ties it together is honesty: every project publishes its failures alongside its results. A broken humanizer gets a seven-cause forensic writeup. A chess engine reports exactly which techniques it borrowed. A trainer retracts its own anti-cheat claims when the reverse-engineering proves them wrong.

focus
researchlanguage models · post-transformer architectures · adversarial ML
engineschess (C++23) · Go self-play networks · high-performance search
systemsCPU-native training · Ascend NPU · x64 reverse-engineering
$ currently — FlashLM CPUFlow v9.7 · Go OGS prep

/ work

Selected work

Six deep case studies, then the rest. Every number is pulled from the project's own source, logs, or paper.

01

FlashLM

CPU-native language models trained from scratch on free-tier hardware.

30+ experiments training language models entirely on 4-vCPU free-tier CPUs — no GPU, no pretraining — on TinyStories. The project surfaces an uncomfortable finding: the training objective matters far more than architecture, and lower perplexity does not produce coherent text.

Active research · 2025–26
10.23
Val PPL · best coherent
CPUFlow v9.7 · 2.47M params
2.5×
FSP aux-loss gain
65K-param head · +1.7% params
2.33
Gated DeltaNet PPL
3.54× over transformer baseline
3,369 tok/s
Throughput
4 vCPU
30+
Experiments
31★
GitHub
1.36v5 Thunderbolt29.7M2.33v7.4 GatedΔNet6.6M9.30v8 discrete2.2M10.23v9.7 memory2.5M

val PPL on TinyStories — lower is better. Best PPL ≠ best coherence.

key innovations

Ternary / BitLinear weights

Weights constrained to {-1, 0, +1}; proved 1.58-bit quantization converges at small scale.

Gated DeltaNet (CORTEX-VIII)

Delta-rule memory M += β(v − M·k)⊗k performs targeted correction rather than blind accumulation — 3.54× better PPL than the transformer baseline.

Cumsum / CPUFlow backbone

An O(n) linear-attention cumsum replacing O(n²) attention — 15× cheaper per layer (136µs vs 2,062µs).

RAM-Net sparse memory

Product-Softmax addressing expands to 512 virtual slots with Top-8 sparse read/write.

Future Sentence Prediction

Predict a bag-of-words 16 tokens ahead via a 65K-param head — the project’s largest single gain (2.5×).

honest

No FlashLM model achieves true narrative coherence yet — every sub-16.8M-param model breaks down ~100 tokens in. v5 "Thunderbolt" (PPL 1.36) is the only readable model, but required 40h on hardware beyond free-tier. Every failure is documented honestly.

repowebsitepaperhf · changcheng967 (cpuflow-v97-memory, flashlm-v5-thunderbolt, …)
PythonPyTorch (CPU/MKL)HuggingFaceTinyStories
02🧬

State-Flow Machine

A post-transformer architecture that tracks program state explicitly.

Replaces a single transformer with four subsystems (perception / execution / structure / meta). An explicit State Slot Bank plus a Gated DeltaNet cell with negative eigenvalues gives reversible state tracking — and on synthetic code-execution benchmarks it generalizes to far longer programs than transformers can.

Research · LLM-scale in progress · 2026
62%
Accuracy @ 4× length
vs transformer 1.9%
35%
Accuracy @ 8× length
vs transformer 1.0%
no help
Scaling transformer 5×
2.2M ≈ 443K on extrapolation
961K
Params (eval)
025507510016×32×
━ State Slots┄ Transformerexact-match % vs program length

key innovations

State Slot Bank

Explicit memory registers that bind to variable identities and store current values. Execution order is preserved via sequential per-chunk writes — something parallel attention fundamentally cannot do.

Gated DeltaNet cell · negative eigenvalues

Learnable eigenvalues constrained to [−1, 1] let the cell subtract state, enabling proper variable reassignment that standard RNNs (eigenvalues in [0,1]) cannot do.

Four-system assembly

Perception (linear attention) + Execution (state slots) + Structure (dynamic GNN) + Meta (recurrent controller with a verification head), combined via a learned cross-system bridge.

DaVinci-Cube aligned

Every dimension a multiple of 16 for the Ascend NPU; log-space parallel scan for FP16 numerical stability.

honest

The validated result is on a synthetic arithmetic-execution task (5 variables, values 0–100). The LLM-scale experiment — fine-tuning Qwen2.5-Coder-1.5B with inline DeltaNet SFM blocks in MindSpore — is in progress; no results yet.

PythonPyTorchMindSporeAscend 910 NPU
03🛡️

Project Aegis

Forensic failure analysis of a broken AI-text humanizer — then a principled fix.

Aegis v3 (a Qwen3-1.7B AI-text humanizer) failed catastrophically: ~30% GPTZero bypass and universal mode-collapse. Rather than guess, this project performs a 360° root-cause diagnosis across behavior, data, reward, and inference — isolating seven independently-confirmed causes — then specifies v4 as the minimal set of literature-proven fixes.

Research · v4 specified, GRPO not yet trained · 2026
~30%
v3 GPTZero bypass
the credible figure
5–11×
v3 mode-collapse
length inflation on every input
loss 0.047 / acc 99.4%
v3 DPO over-opt
textbook reward hacking
7
Root causes diagnosed

key innovations

Seven-cause forensic diagnosis

Mode-collapse, "dumbcrafting" data (injected misspellings verified in the preference pairs), single-detector reward, no meaning constraint, forced register/person, self-sabotaging post-processing, and a backwards premise.

Constrained GRPO + SimPO

DEPO-style GRPO with a hard semantic-preservation constraint (BERTScore ≥ 0.85). Padding violates the constraint, so collapse is structurally prevented — not heuristic.

Real-student-essay data

A level-matched, person-preserved dataset (11.7k pairs, no injected errors) replacing the flawed "errors = human" data.

Diverse free-detector ensemble

MAGE / RADAR / Binoculars / Fast-DetectGPT / RoBERTa for cross-detector transfer instead of a single reward.

honest

v4 is specified and its data/scripts are ready, but GRPO training awaits compute. v4-SFT has been trained (clean output, no collapse) yet does not yet bypass GPTZero (100% AI) — commercial-detector transfer is the explicit, unmeasured open question. (The v3 HuggingFace card’s "98% human" is a stale marketing claim; the paper’s measured figure is ~30%.)

hf · changcheng967/Aegis-Qwen3-Humanizer-v3 · Aegis-Qwen3-8B-Humanizer-v1
PythonTRL (GRPO/SFT/DPO)PEFT/LoRAQwen3-1.7Bbert_score
04

KW Serie — Go AI

KataGo-compatible Go networks trained from scratch via self-play, with a playable CPU engine.

A family of self-trained Go neural networks (the "KW serie") trained end-to-end via self-play, spanning 19×19 across three architecture generations and a dedicated 9×9 pipeline — plus Kata_web, a CPU-runnable web engine anyone can play against online.

Active · OGS integration next · 2025
3206
9×9 CGOS rating
third-party-measured · 135 games
21
19×19 checkpoints
across 3 arch generations
0.987
9×9 policy accuracy
at 8.5M self-play samples
0.852
9×9 value accuracy
0.534 → 0.852
0.70.80.91.025k256k2M4.5M8.6M

9×9 policy accuracy vs self-play samples (log) — 0.733 → 0.987.

key innovations

From-scratch self-play

Not a fine-tune — networks seeded only by KataGo’s bootstrap net and trained end-to-end via self-play.

Three architecture generations

b28c512nbt (291MB) → efficient b18c384nbt (105MB) with aggressive auto-LR to reach competitive strength on far less compute.

Dedicated 9×9 pipeline

A separate training line producing a released final at ~8.5M self-play samples.

Playable CPU web engine

Kata_web pulls the pure-CPU KataGo Eigen build plus the latest KW model and serves a live bot, deployed via GitHub Pages.

honest

The headline measured result is the 9×9 CGOS rating (3206). 19×19 competitive play on OGS is the stated next milestone, not yet live. Training on Ascend NPUs is the author’s account; the 9×9 self-play data was generated on a Colab T4.

C++ (engine)Python (training)KataGoGTPPlotlyAscend NPU
05♟️

Luminex

A UCI chess engine in C++23 with a novel Phased Move Generation.

A modern classical chess engine (~8,900 LOC, zero external dependencies) featuring a genuinely novel Phased Move Generation optimization and a fully self-engineered hand-crafted evaluation — every PST, mobility coefficient, and king-safety term derived from chess first principles.

Active · v5.15 · 2025–26
~8,902
Total LOC
single-author · zero deps
2,527
Search LOC
1,463
Evaluation LOC
5
Release binaries
Linux / Windows / macOS ARM64

key innovations

Phased Move Generation

Generates moves in priority phases (TT → captures → quiets); in positions where a capture cuts off, the quiet generator is never invoked. Live-instrumented with cutoff stats.

Full modern search arsenal

PVS, LMR (noisy/quiet tables), null-move, singular + double extension + multi-cut, ProbCut, razoring, aspiration windows, SEE quiescence, lazy SMP, IIR/IID, multi-table correction history.

Self-engineered evaluation

Hand-crafted MG/EG eval with self-derived PSTs, per-level mobility tables, a quadratic king-safety model, and KXK/KNK/KRK mating specials — no eval values borrowed from any engine.

Texel tuner

A 148-parameter direct-optimization tuner with an OpenMP dataset generator.

honest

Evaluation values are original; the search layer openly builds on published open-source techniques (Stockfish / Stash / Ethereal), all GPL-attributed in comments — normal practice. Not yet listed on a public rating list (CCRL), so competitive strength is unverified externally.

C++23CMakeSIMD (AVX2/SSE4.2/BMI2)UCI
06🏁

FH6 All-in-One Trainer

A Forza Horizon 6 trainer — serious low-level Windows reverse-engineering.

The most-starred open-source project (172★): a C# trainer combining inline code-cave hooks with live SQL execution against the game’s in-memory database. The most interesting part is the integrity-scanner research — and the honesty of walking back a wrong assumption.

Active · v6.6.1 · 2026
172★ / 20 forks
GitHub
30
Releases
in ~5 weeks
~37
Cheat surface
runtime hooks + SQL features
~5,000
C# LOC

key innovations

AOB scanning + x64 code-cave detours

A custom array-of-bytes pattern scanner; inline hooks auto-relocate live register/offset bytes so minor game-build changes are absorbed. ~22 feature signatures.

In-memory SQL execution

Resolves the game’s CDatabase vtable[9] (ExecuteQuery), builds a 34-byte x64 shellcode stub, allocates RWX memory, and fires it via CreateRemoteThread — driving the game’s own SQLite engine.

UWP module discovery

Handles the UWP edge case where .NET’s Process.MainModule throws AccessDenied, via EnumProcessModulesEx.

Intellectual honesty

Ghidra decompilation proved the earlier "anti-cheat bypass" was actually corrupting a flag manager and causing the crashes — so it was removed and the false claims retracted (commit 39c8edc).

honest

Offline-only game modding, against the game’s ToS — dual-use, not security research in the disclosure sense. The current build ships no anti-cheat bypass; it operates through its own code-cave hooks with the game’s integrity system left intact. Pinned to game build v379.939.

C#.NET 10Avalonia UIWin32 P/Invokex64 shellcode

more

/ skills

The toolkit

What I reach for across research and systems work. Levels reflect demonstrated depth in shipped projects, not self-rating.

Languages

Python95
C++92
C#85
TypeScript82
Kotlin68

AI / ML

PyTorch90
MindSpore80
Ascend NPU78
TRL (GRPO/DPO)78
Self-play Training82

Systems & Perf

C++23 / SIMD88
x64 Reverse Engineering80
Win32 / P-Invoke78
Low-level Optimisation85
GTP / Engine Design80

Frameworks & Tools

Next.js / React84
.NET / WPF82
MCP Protocol80
CMake78
Avalonia UI72

/ telemetry

Live from GitHub

loading…
--
repositories
--
stars earned
--
forks
--
languages

/ contact

Want to talk research, systems, or collaboration?

I read everything. Fastest reach is email.