Student · AI Researcher · Systems Engineer

I build intelligent systems
& high-performance engines

Software engineer specializing in AI/ML research, game engines, and systems programming. Currently

changcheng@desktop
$ whoami
cheng chang — software engineer & ai researcher
$ cat domains.txt
chess engines · ternary llms · go ai · systems programming
$ echo $FAVORITE_MODELS
opus-4.6 glm-5-turbo
$ cat status.txt
training sfm experiment 1 on ascend 910
GitHub Stars
0+
across all repositories
Contributions
2,063
in the last year
Projects
16
shipped projects
Languages
8
programming languages
C++·
Python·
TypeScript·
C#·
PyTorch·
MindSpore·
Next.js·
React·
Qt 6·
Ascend NPU·
Git·
GLM 5 Turbo·
C++·
Python·
TypeScript·
C#·
PyTorch·
MindSpore·
Next.js·
React·
Qt 6·
Ascend NPU·
Git·
GLM 5 Turbo·
Scroll to explore

About

Driven by curiosity,
powered by code

C
Cheng Chang
changcheng967 · Student · AI Researcher · Systems Engineer

I'm a student and software engineer with a deep passion for building systems that push boundaries. My work spans from low-level C++ chess engines and Go neural networks to cutting-edge ternary language models and private AI assistants.

I'm the creator of FlashLM (24 GitHub stars), a CPU-native ternary language model that proved ternary weights can match float16 performance. I built the State-Flow Machine, a novel post-transformer architecture that generalizes 30x better than standard transformers on long programs. My chess engines Luminex and Douchess compete at superhuman search depth with bitboard-based evaluation.

I train custom Go networks on KataGo from scratch, built a download manager that hits 213+ MB/s, created an essay grader with 98-99% teacher alignment, and built 4 Chrome extensions used by real users. I develop primarily with Claude Code, powered by my favorite models — Opus 4.6 and GLM 5 Turbo.

Canada · changcheng6541@gmail.com

At a glance

64
Repositories
40
Stars Earned
11
Forks
8
Languages

“Understand the system deeply, then build it to perform flawlessly.”

🧠

AI & Machine Learning

Ternary LLMs (FlashLM, 24★), knowledge distillation, neural network training on Huawei Ascend NPUs, multi-agent systems, post-transformer architectures

♟️

Game Engines

Two chess engines (Luminex in C++23, Douchess in C++17) with bitboard architecture, Go AI with custom KataGo neural networks trained via self-play

Systems Performance

Bolt Download Manager (213+ MB/s peak, 4x faster than IDM), parallel computing, NPU acceleration, C# desktop apps with Fluent Design

🌐

Full-Stack & Extensions

Next.js, TypeScript, Python, 4 Chrome extensions (DouGrammar, Doucite, Doulet AI, Panic Button), AI-powered web platforms, MCP servers

Philosophy

Every law in this field is an assumption waiting to be shattered.

You need float16 weights to build real language models? FlashLM proved that three numbers — negative one, zero, and one — can do the same on a CPU. That assumption died.

Transformers are the ceiling for sequence modeling? State-Flow Machine proved explicit state tracking breaks through by 30x. That assumption died too.

Projects

Selected work

A curated collection of projects spanning AI research, systems programming, game engines, and web development.

🌊Featured Research Project

State-Flow Machine

A novel post-transformer architecture for code intelligence that replaces the single-transformer paradigm with 4 specialized systems. The core insight: coding is about state transformations — what a program does vs what it should do — and explicit state tracking generalizes to longer programs in ways that implicit token-level models provably cannot (TC0 circuit complexity limit, Siems et al. ICLR 2025).

The 4 Systems

System 1: Perception
Linear-attention decoder, O(n) complexity. Reads tokens with causal masking for autoregressive generation. Supports stateful incremental decoding.
System 2: Execution — The Breakthrough
State Slot Bank: 16-64 explicit memory registers that bind to variables and track values through execution. Gated DeltaNet cells with eigenvalues in [-1,1] — negative eigenvalues enable state tracking transformers fundamentally cannot do. Sequential per-chunk write preserves execution order (attention is parallel, this is not).
System 3: Structure
Dynamic graph neural network. Nodes = functions, classes, variables, files. Edges = calls, imports, mutates, reads, defines, contains. Sparse message-passing gives the model a live dependency map. Supports incremental graph updates.
System 4: Meta
Recurrent controller with hypothesis register (tracks what's wrong), plan stack (intended actions with push/pop), and verification head (checks output before emitting). Prevents death-spirals via gated correction with attempt counting.

Cross-System Bridge

Every 2 perception layers, all 4 systems synchronize via projection to a shared 256d space with learned gates. Final output is a learned weighted combination of all system outputs (initialized at 25% each).

Architecture Pipeline

Input Tokens
Perception
Linear Attention · O(n)
Execution
State Slot Bank · Gated DeltaNet
Structure
Dynamic GNN
Meta
Recurrent Controller
Prediction
cross-system bridge · 256d shared space

Experiment 0: Length Generalization

Task: Predict final value of target variable after arithmetic operations. Trained on 10-27 ops, evaluated at up to 32x length.

LengthState SlotsTransformer-FairTransformer-Large
1x99.9%100.0%100.0%
2x92.9%99.0%99.5%
4x62.0%1.9%3.1%
8x35.3%1.3%1.0%
16x5.1%0.9%0.7%
32x5.0%1.0%0.8%

SFM retains 62% at 4x length while transformers collapse to ~2% — a 30x generalization gap. The 2.2M Transformer-Large performs no better than 443K Transformer-Fair: this is an architectural limitation, not a scale issue.

Experiment 1: Thinker-1.5B (In Progress)

Fine-tuning Qwen2.5-Coder-1.5B with DeltaNet SFM blocks for code execution reasoning. Simple delta rule: S = S - β(S@k - v)@kT, 16 heads × 16×16 state, inserted after layers 6, 13, 20, 27.

Multi-loss: masked CE + 0.1×judge_BCE + 0.01×surprise_MSE. Self-evolution via EWMA difficulty adaptation. Synthetic exec() traces + debugging samples generated on-the-fly. Trained on 4× Ascend 910 with MindSpore 2.2 + CANN 7.

PyTorchMindSporeAscend 910DeltaNetGNNState SlotsCube OptimizationMIT License
🔥
24

FlashLM

CPU-Native Ternary Language Models — proving that {-1, 0, +1} weights can match float16 performance. v7 "Eclipse" is a 124M parameter BitNet b1.58 transformer trained on FineWeb-Edu, featuring ARM NEON/OpenMP kernels and Ascend NPU acceleration. Featured on Reddit r/LocalLLaMA.

PyTorchCAscend NPUNEONView
♟️

Luminex

World-class classical chess engine in C++23. Features LMR, null move pruning, singular extension, aspiration windows, SEE-based quiescence search, and comprehensive evaluation with PST, mobility, pawn structure, and king safety. ~7000 LOC, ~280KB binary.

C++23BitboardUCI
🔮

NEXUS v2

7-layer private AI assistant: Soul (immutable constitution), Observer (Windows UI automation via pywinauto), Encoder (Transformer), Memory (FAISS + SQLite), Resonator (retrieval-reasoning), Decoder (GRU), Agent (trust levels). Self-evolves through daily training on Ascend 910 NPUs.

PythonPyTorchFAISSFastAPI
🤖

Nano-Coder

Building the world's best 8B coding agent through knowledge distillation. Fine-tunes Qwen3-8B with LoRA on coding trajectories (SWE-bench, CoderForge) using 4-NPU pipeline parallelism on Ascend 910ProA, with custom FlashAttention patch.

PythonMindSporeLoRAAscend 910ProA
1

Douchess

High-performance C++17 chess engine using 64-bit bitboard architecture. Robust search algorithm with deeply tuned handcrafted evaluation. Represents the "Classical" development era with 16M-node transposition table and 0-950ms variable search time.

C++17BitboardZobristView

AscendGo

GTP engine with self-play learning in Go. Custom KataGo-compatible neural network weights (KW/KW9x9) trained from scratch with extensive self-play, achieving competitive performance in both 9x9 and 19x19. 21 model checkpoints for 19x19.

C++PyTorchKataGoSelf-Play

Bolt Download Manager

High-performance download accelerator in C++23 with dynamic segmentation (16-32 segments), work stealing, and stalled segment recovery. HTTP/2 support, Windows async I/O. Peaks at 213+ MB/s — 4x faster than IDM. Qt6 GUI + CLI + browser extension.

C++23Qt 6libcurlBoost.Asio
📡

RepoBeam

LLM-friendly code browsing platform. Indexes GitHub repos, extracts symbols using LLM (Qwen2.5-Coder-32B via NVIDIA NIM), and serves code in token-efficient chunks with symbol navigation and code search. Designed for AI agents.

Next.jsTypeScriptSupabaseNVIDIA NIM
📊

TeachAssist Desktop

Modern Windows app for YRDSB students. Fluent Design UI with Mica backdrop, grade trends visualization (ScottPlot), What-If calculator, grade goals, CSV/HTML export, course code decoding, and school name extraction. Built with WPF and .NET 10.

C#.NET 10WPFScottPlot
🎧

WinPods

Brings the native AirPods experience to Windows with iOS 26 Liquid Glass UI. BLE battery monitoring, auto-connect on case open, media controls (play/pause from system tray), low battery alerts, and system tray integration.

C#WinUI 3BLEWindows App SDK
✏️

DouGrammar

Advanced grammar checking Chrome extension with real-time checking, spelling correction, style suggestions, and readability analysis. Supports 15+ languages and multiple AI providers (DeepSeek, OpenAI, Google, Anthropic, Qwen).

JavaScriptChrome ExtensionMulti-AI15+ Languages
📝

Doucite

One-click citation generator Chrome extension. Layered metadata extraction (citation/DC meta, JSON-LD, OG tags, visible text, regex) with APA 7, MLA 9, Chicago formatting and BibTeX export. Special handling for government/academic sites.

JavaScriptChrome ExtensionAPA/MLA/ChicagoBibTeX
💡

Doulet AI Assistant

Browser extension providing AI-powered answers on any webpage. Highlight text and get comprehensive educational responses. Uses NVIDIA NIM as primary API with OpenRouter fallback. 10+ free models, custom prompts, multi-language.

JavaScriptChrome ExtensionNVIDIA NIMOpenRouter
🔮

NEGAA

Next.js application platform using NVIDIA NIM APIs for AI-powered capabilities. Modern web stack with TypeScript and ESLint integration.

Next.jsTypeScriptNVIDIA NIM
🔍

free_web_tools

Zero-cost MCP server providing web search and content extraction. 4 tools: web_search, fetch_url, news_search, related_searches. Uses SearXNG public instances and Jina AI Reader. No API keys needed.

PythonMCP ProtocolhttpxZero-cost
🚨

Panic Button

Browser extension that instantly hides all tabs except saved "important" ones. Save important websites, close all others with "PANIC" mode, keyboard shortcuts (Ctrl+Shift+X), context menu integration.

JavaScriptChrome ExtensionKeyboard Shortcuts

Under the Hood

Code that speaks

Real code from real projects. Every snippet is production code that shipped — not pseudocode, not examples.

FlashLM·flashlm/quantize.py

The core insight: replace float16 weights with {-1, 0, +1} — enabling CPU-native inference at extreme speed.

flashlm/quantize.pyPython
1def quantize_ternary(weight: Tensor) -> Tensor:
2 """BitNet b1.58 quantization: W -> {-1, 0, +1}"""
3 scale = weight.abs().mean()
4 quantized = weight.clone()
5 quantized[weight.abs() < 0.667 * scale] = 0
6 quantized[weight > 0.667 * scale] = scale
7 quantized[weight < -0.667 * scale] = -scale
8 return quantized

Philosophy

No law is unbreakable

Every law in this field is an assumption waiting to be shattered. Here are four assumptions that I found, tested, and destroyed — each one backed by a shipped project.

“Every law in this field is an assumption waiting to be shattered. The question is never whether it can be broken — it is whether you have the audacity to try.”
C
Cheng Chang
AI Researcher & Systems Engineer
01

Question every assumption

They said you need float16 weights to build real language models.

FlashLM proved that three numbers — negative one, zero, and one — can achieve meaningful language modeling on a CPU. 24 GitHub stars, featured twice on Reddit r/LocalLLaMA. That assumption died with a 124M parameter model running on consumer hardware.

FlashLM
02

Find the architectural ceiling

They said transformers are the ceiling for sequence modeling — longer programs are architecturally impossible to generalize.

State-Flow Machine proved that explicit state tracking breaks through by 30x. At 4x training length: SFM achieves 62% while transformers collapse to ~2%. A 2.2M Transformer-Large performs no better than a 443K one — this is not a scale issue. It is an architectural wall.

State-Flow Machine
03

Build what they say cannot be built

A download manager that beats IDM? A Go AI trained from scratch? Private AI without cloud APIs?

Bolt DM hits 213+ MB/s — 4x faster than the industry standard. AscendGo trains neural networks from zero via self-play. NEXUS v2 runs 7 AI layers entirely on local Ascend NPUs, falling back to cloud only when absolutely necessary. Every one of these was "impossible" until it shipped.

Systems
04

The frontier is defined by what we question

Every project I build is an attempt to find the next assumption that is wrong.

Nano-Coder pushes knowledge distillation to its limits on Ascend 910ProA with 4-NPU pipeline parallelism. SFM Thinker-1.5B extends explicit state reasoning to full code synthesis with DeltaNet gates. FlashLM v8 "Nova" explores hybrid ternary-binary quantization. The frontier of intelligence is not defined by what we know is possible — it is defined by what we are willing to test, question, and ultimately destroy.

Research

The next generation of AI systems I build will continue to find where the current paradigm breaks — and build something that does not. This is not arrogance. This is the scientific method applied to engineering.

Recognition

Achievements

Featured releases, milestones, and community recognition across AI research, systems programming, and open source.

🌟
FeaturedJan 2025

FlashLM Featured on Reddit r/LocalLLaMA

Featured twice on one of the largest local AI communities. Thousands of views and discussions about ternary language models running on CPU hardware.

🔬
MilestoneJun 2025

SFM 30x Generalization Breakthrough

State-Flow Machine achieved 62% accuracy at 4x length vs ~2% for transformers. Published experiment results confirming architectural advantage over scale.

MilestoneMar 2025

Bolt DM: 213+ MB/s Download Speed

Built a C++23 download accelerator that peaks at 4x faster than Internet Download Manager. Dynamic segmentation with work stealing and stalled recovery.

🔥
FeaturedOngoing

FlashLM 24 GitHub Stars

Open-source ternary language model repository earned 24 stars from the AI/ML community. Proof that the ternary paradigm resonates with researchers.

♟️
Release2024-2025

Two Chess Engines Shipped

Luminex (C++23, ~7000 LOC) and Douchess (C++17) — both using bitboard architecture with superhuman search depth. Luminex features LMR, null move pruning, singular extension.

🧩
ReleaseFeb 2025

4 Chrome Extensions Published

DouGrammar, Doucite, Doulet AI Assistant, and Panic Button — shipped and used by real users for grammar checking, citations, AI answers, and tab management.

📝
MilestoneMar 2025

DouEssay: 98-99% Teacher Alignment

AI Writing Mentor that grades essays using Ontario curriculum rubrics with near-perfect alignment to teacher scoring. Real-time feedback and visual analytics.

🔮
ReleaseSep 2025

NEXUS v2: 7-Layer Private AI

Built a private AI assistant with Soul, Observer, Encoder, Memory, Resonator, Decoder, and Agent layers. Self-evolves through daily training on Ascend 910 NPUs.

MilestoneDec 2024

Go Neural Networks Trained from Scratch

Trained custom KataGo-compatible neural network weights (KW/KW9x9) from scratch via self-play on Huawei Ascend NPUs. 21 model checkpoints for 19x19.

“The best projects are the ones people tell you not to build. They said ternary models cannot work. They said transformers are the ceiling. They said a student cannot build engines that compete with decades-old projects. I built them anyway.”

Now

What I'm working on

Active projects and research directions — pushing boundaries across AI, game engines, and autonomous agents.

SFM Experiment 1 Training

Fine-tuning Qwen2.5-Coder-1.5B with DeltaNet SFM blocks for code execution reasoning — the second major experiment validating the State-Flow Machine architecture.

Training

Training on 4x Ascend 910 NPUs with the full SFM architecture. Multi-loss: masked CE + 0.1x judge_BCE + 0.01x surprise_MSE. Self-evolution via EWMA difficulty adaptation. Synthetic exec() traces and debugging samples generated on-the-fly.

Target: demonstrate SFM code reasoning at scale
MindSpore 2.2Ascend 910DeltaNetCANN 725%

AscendGo

Bug fixing, benchmarking visits-per-second search performance, and testing on OGS (Online Go Server) for real-world competitive play validation.

Testing

Fixing critical search and evaluation bugs found during self-play testing. Benchmarking how many visits per second the search can achieve on the machine to measure raw performance. Preparing for OGS integration via GTP protocol for live testing against human players.

Target: competitive play on OGS
C++PyTorchKataGoGTP ProtocolOGS60%

NEXUS v2

Making the private AI assistant better — improving layer architecture, response quality, and getting closer to OpenClaw-level autonomous agent capabilities.

Building

Refining the 7-layer architecture (Soul, Observer, Encoder, Memory, Resonator, Decoder, Agent) for better tool use and reasoning. Implementing OpenClaw-inspired autonomous action patterns — multi-step task execution, file system navigation, and self-healing error recovery. Daily self-evolution training on Ascend NPUs.

Target: approach OpenClaw agent capabilities
PythonPyTorchFAISSFastAPIAscend 91045%
🔮

Future Vision

The next frontier: SFMs that reason — extending the State-Flow Machine from arithmetic to full code synthesis, debugging, and self-improvement. The 4-system architecture is not just a research artifact; it is a blueprint for how AI should think.

Ternary models at scale — proving that FlashLM's ternary weight paradigm extends beyond 124M. If a model can think in three states, it can think in any state. The question was never "can ternary work?" — it was "how far can we push it?"

Autonomous agents on consumer hardware — NEXUS approaching OpenClaw-level capabilities with full Ascend NPU acceleration, self-evolving daily, requiring zero cloud APIs. Private AI that runs entirely on your desk.

Open Source

Built in the open

Every project is open-source and publicly available. Research, code, and training configurations — nothing hidden behind paywalls.

Open by default

Every major project ships on GitHub with MIT or permissive licensing. Research, code, and training scripts are public.

Ship early, iterate often

FlashLM went through 8 versions. Douchess v1 evolved into Luminex. Each release is better than the last because of rapid iteration.

Document everything

Comprehensive READMEs with architecture explanations, experiment results, and reproducible training configurations.

Build in public

Community feedback from Reddit, GitHub issues, and discussions directly shapes project direction and priorities.

Featured Repositories

GitHub

Activity at a glance

Live from GitHub API

Skills

Tech stack

Languages, frameworks, and platforms I work with across my projects.

Skill Profile

20 skills across 4 domains

🖥️
Languages

C++ / C
Python
TypeScript / JavaScript
C# / .NET
HTML / CSS

🧠
AI / ML

PyTorch
Huawei Ascend / MindSpore
Neural Architecture Design
Knowledge Distillation
Transformer / DeltaNet

🏗️
Frameworks & Tools

Next.js / React
Qt 6
WPF / WinUI 3
Chrome Extensions API
CMake / Ninja

🎯
Specializations

Game Engines (Chess / Go)
Bitboard Architecture
Systems Performance
NPU / GPU Acceleration
Circuit Complexity Theory

Development

How I build

My tools, models, and workflow for turning ideas into working systems.

Claude Code

Primary development tool — AI-powered terminal coding agent

🧠

Opus 4.6

Favorite model for complex reasoning, architecture design, and code generation

GLM 5 Turbo

Go-to model for fast, high-quality development and problem solving

🔥

Huawei Ascend 910

NPU cluster for training AI models — the hardware that powers FlashLM, SFM, and Nano-Coder

📦

Git & GitHub

64 repositories, version control, open-source contributions

🪟

Windows 11

Primary development environment with WSL2, WPF/WinUI apps, and system-level programming

Workflow

1

Research problem deeply before writing code

2

Prototype with Claude Code (Opus 4.6 / GLM 5 Turbo)

3

Train and iterate on Ascend 910 NPUs

4

Test rigorously, optimize performance

5

Ship and open-source when ready

Contact

Let's build something
extraordinary together

Interested in AI research collaboration, discussing architectures, contributing to open source, or just want to chat? I'm always open to interesting conversations.

What I'm open to

🔬

AI Research Collaboration

Novel architectures, post-transformer systems, ternary models, knowledge distillation

♟️

Game Engine Development

Chess, Go, or board game AI — neural network training, search algorithms, evaluation

🌐

Open Source Projects

Contributing to or building developer tools, AI infrastructure, and community projects

💬

Technical Discussions

System design, performance optimization, NPU/GPU computing, and architecture decisions

Currently available for collaboration
Response time: usually within 24h