Skip to main content
Student · AI Researcher · Systems Engineer

I build intelligent systems
& high-performance engines

Software engineer specializing in AI/ML research, game engines, and systems programming. Currently

changcheng@desktop
$ whoami
cheng chang — software engineer & ai researcher
$ cat domains.txt
chess engines · ternary llms · go ai · systems programming
$ echo $FAVORITE_MODELS
opus-4.7 glm-5.1
$ cat status.txt
building flashlm v10 with bitlinear attention
GitHub Stars
0+
across all repositories
Contributions
2,063
in the last year
Projects
0+
shipped projects
Languages
0+
programming languages
C++·
Python·
TypeScript·
C#·
Kotlin·
PyTorch·
MindSpore·
Next.js·
WinUI 3·
Ascend NPU·
KataGo·
GLM-5.1·
C++·
Python·
TypeScript·
C#·
Kotlin·
PyTorch·
MindSpore·
Next.js·
WinUI 3·
Ascend NPU·
KataGo·
GLM-5.1·
Scroll to explore

About

Driven by curiosity,
powered by code

C
Cheng Chang
changcheng967 · Student · AI Researcher · Systems Engineer

I'm a student and software engineer with a deep passion for building systems that push boundaries. My work spans from low-level C++ chess engines and Go neural networks to cutting-edge ternary language models and kernel drivers.

I'm the creator of FlashLM (27 GitHub stars), a CPU-native ternary language model spanning 162 commits and 8 phases — proving Gated DeltaNet achieves 3.54x better PPL than transformers on identical data. My chess engine Luminex introduces novel Phased Move Generation with a fully self-engineered evaluation in ~8,050 lines of C++23. I train custom KataGo neural networks from scratch via self-play on Ascend NPUs.

I built WinPods — bringing the native AirPods experience to Windows with a custom KMDF kernel driver for noise control, TeachAssist for both Desktop and Android, and open-source MCP tooling. I develop primarily with Claude Code, powered by my favorite models — Opus 4.7 and GLM-5.1.

Canada · changcheng6541@gmail.com

At a glance

64
Repositories
27
Stars Earned
11
Forks
8
Languages

“Understand the system deeply, then build it to perform flawlessly.”

🧠

AI & Machine Learning

Ternary LLMs (FlashLM, 27★), custom KataGo neural networks trained via self-play on Ascend NPUs, post-transformer architectures (Gated DeltaNet), knowledge distillation

♟️

Game Engines

Luminex — C++23 chess engine with novel Phased Move Generation and fully self-engineered evaluation. ~7,300 LOC, zero dependencies, cross-platform

Desktop Applications

TeachAssist Desktop (C#/.NET 10), TeachAssist Android (Kotlin/Compose), WinPods (AirPods for Windows with custom KMDF kernel driver), WPF/WinUI 3

🌐

Tools & Infrastructure

MCP servers, AI-powered web platforms, Chrome extensions, LLM-friendly code browsing (RepoBeam), NPU-accelerated training pipelines

Philosophy

Every law in this field is an assumption waiting to be shattered.

You need float16 weights to build real language models? FlashLM proved that three numbers — negative one, zero, and one — can do the same on a CPU. That assumption died.

Transformers are the ceiling for sequence modeling? Gated DeltaNet proved targeted correction memory achieves 3.54x better PPL at identical scale. That assumption died too.

Projects

Selected work

A curated collection of projects spanning AI research, systems programming, game engines, and desktop applications.

🔥Featured Project★ 27

FlashLM

CPU-native language models trained entirely from scratch — no GPUs, no pretraining. Exploring ternary quantization, Gated DeltaNet, and test-time search to push the limits of what small models can achieve on free-tier hardware.

162 commits across 8 development phases. 7 models on HuggingFace. Every experiment documented — including all failures.

Ternary Weights

Weights constrained to {-1, 0, +1}. Proved 1.58-bit quantization converges at small scale.

🧠Gated DeltaNet

Delta rule memory (M += β(v - M·k)⊗k) performs targeted correction. 3.54x better PPL than transformer baseline.

🔮STMM (v9.4)

GRU + VQ-VAE codebook for explicit entity state tracking across sentence boundaries. Latest innovation.

🏆CORTEX-VIII

Best PPL on free-tier CPU: 2.33 with 6.6M params in 2h. Beat transformer baseline by 3.54x.

🎯SearchLM

AlphaGo-inspired test-time compute. Value heads genuinely learned (V_Corr +0.66), but search doesn't fix coherence.

📝Honest Research

Every failure documented. Reckoning at PPL 130, RWKV at 377, Story Compass at 17.56. No cherry-picking.

BitLinear Attention (v10)

Lean ternary attention architecture achieving 11k tok/s on CPU with pure BitLinear projections — 3x speedup over standard linear layers.

PPL Evolution

1.01.52.02.53.0PPLv51.36v7.42.33v82.40v8.32.50v10?pending
PyTorchCHuggingFaceTernary WeightsGated DeltaNetCORTEX-VIIISTMMBitLinear AttnMIT License
♟️Featured Project

Luminex

A world-class UCI chess engine written in modern C++23. Features a novel Phased Move Generation optimization and a fully self-engineered hand-crafted evaluation — every PST value, mobility coefficient, and king safety parameter derived from chess first principles.

~8,050 lines of code. Zero external dependencies. Cross-platform: Linux, Windows, macOS (Apple Silicon native).

Phased Move Generation

Novel optimization: generates moves in priority phases (TT → captures → quiets). ~70-80% of positions cut off before quiet gen is ever invoked.

Self-Engineered Evaluation

1,402 lines of hand-crafted evaluation — no values borrowed from PeSTO, Ethereal, or any other engine. All weights derived from chess first principles.

Comprehensive Search

PVS with LMR, null move pruning, singular extension, probCut, razoring, quiescence with SEE, and mate distance pruning. 2,265 lines of search logic.

Cross-Platform

5 pre-built binaries (Linux AVX2/SSE, Windows ClangCL/MSVC, macOS ARM64). CMake with auto-detected SIMD. ~8,050 LOC total, zero dependencies.

0
Lines of Code
0
Search Lines
0
Eval Lines
0
Platforms

Search & Evaluation Arsenal

PVSLMRNull MoveSingular ExtensionAspiration WindowsSEE QuiescenceProbCutRazoringIIR/IIDLazy SMPPST EvaluationMobilityPawn StructureKing Safety (Sigmoid)ThreatsEndgame KnowledgeCorrection History
C++23CMakeBitboardMagic BitboardsUCI ProtocolAVX2/SSENon-Commercial License

AscendGo

Go AI with custom KataGo-compatible neural networks trained from scratch via self-play. 21 checkpoints for 19x19 across 3 architectural generations (18b/384ch, 28b/512ch, refined), plus a dedicated 9x9 pipeline with a released model at 8.5M self-play steps. Deployed via custom C++ GTP engine for OGS and CGOS competitive play.

C++PyTorchKataGoAscend NPUGTP
📊

TeachAssist Desktop

Polished Windows 11 desktop app for YRDSB students. Bento-grid dashboard, grade trend charts, What-If calculator, grade goals, CSV/HTML export, Ontario course code decoding, and auto-login via Windows Credential Manager. v5.0 with fully custom components.

C#.NET 10WPFScottPlotView
📱

TeachAssist Android

Material Design 3 Android companion app. Spring-animated Grade Ring, background grade-change notifications via WorkManager, biometric login, What-If calculator, confetti celebrations, AMOLED/Dynamic Color theming, and offline disk caching. v2.9.0 targeting SDK 35.

KotlinJetpack ComposeMaterial 3Hilt
🎧

WinPods

Brings the native AirPods experience to Windows. iOS-style translucent battery popup, auto-connect on case open, media controls, ear detection, and full noise control (ANC/Transparency/Adaptive) via a custom KMDF kernel driver for L2CAP access. Supports all AirPods and Beats with W1/H1/H2 chips.

C#WinUI 3KMDF DriverBLEView
🔍

free_web_tools

Comprehensive MCP server with 14 tools for web search, deep research, GitHub integration, code search, and package lookups. Multi-backend search (DDG + Mojeek + Bing + Startpage), content extraction, Wikipedia, and auto-answer synthesis. Zero API keys needed. v5.0.

PythonMCP ProtocolGitHub APIMulti-Backend

Under the Hood

Code that speaks

Real code from real projects. Every snippet is production code that shipped — not pseudocode, not examples.

FlashLM·flashlm/quantize.py

The core insight: replace float16 weights with {-1, 0, +1} — enabling CPU-native inference at extreme speed.

flashlm/quantize.pyPython
1def quantize_ternary(weight: Tensor) -> Tensor:
2 """BitNet b1.58 quantization: W -> {-1, 0, +1}"""
3 scale = weight.abs().mean()
4 quantized = weight.clone()
5 quantized[weight.abs() < 0.667 * scale] = 0
6 quantized[weight > 0.667 * scale] = scale
7 quantized[weight < -0.667 * scale] = -scale
8 return quantized

Skills

What I work with

Languages, frameworks, and platforms I use across my projects -- from low-level systems to large language models.

💻

Languages

Python0%
C++0%
TypeScript0%
C#0%
Kotlin0%
🧠

AI / ML

PyTorch0%
MindSpore0%
Ascend NPU0%
KataGo0%
Transformers0%

Frameworks

Next.js0%
React0%
.NET / WPF0%
WinUI 30%
Tailwind CSS0%
🎯

Specializations

LLM Training0%
Game Engines0%
Systems Perf0%
Chrome Ext0%
Kernel Drivers0%

Technology Map

Tech Constellation

Hover over a node to explore connections

C++PythonTypeScriptC#PyTorchMindSporeNext.jsAscend NPUKataGoWinUI 3KotlinGLM-5.1

Now

What I'm working on

Active projects and research directions — pushing boundaries across AI, game engines, and autonomous agents.

FlashLM v10

Developing BitLinear attention architecture for CPU-native ternary LMs. v10 achieves 11k tok/s with pure BitLinear projections.

Building

Stripped to lean BitLinear attention with d=256, L=4, H=4, ~3.9M params. All projections ternary. Standard causal attention replacing Gated DeltaNet for simplicity and speed. Training on TinyStories V2-GPT4 full train split.

Target: coherent generation at 4M params
PyTorchBitLinearCausal AttentionTinyStories40%

AscendGo

Bug fixing, benchmarking visits-per-second search performance, and testing on OGS for real-world competitive play with custom KataGo neural networks.

Testing

21 trained checkpoints for 19x19 across 3 architectural generations. Dedicated 9x9 pipeline with released model at 8.5M self-play steps. Fixing critical search and evaluation bugs. Benchmarking visits-per-second on the machine. Preparing for OGS integration via GTP protocol.

Target: competitive play on OGS
C++PyTorchKataGoGTP ProtocolOGS65%

Nano-Coder

Knowledge distillation pipeline for 8B coding agent on Ascend 910ProA with 4-NPU pipeline parallelism.

Research

Building NC-1 Preview with pipeline validation on OpenI. Architecture based on SFM (State-Flow Machine) with delta rule memory for code execution reasoning. Runtime FlashAttention patch for unsupported CANN kernels. Targeting best 8B coding agent.

Target: NC-1 Preview pipeline validation
MindSporeAscend 910LLaMA-FactoryFlashAttention20%
🔮

Future Vision

FlashLM v10 — bridging the gap between speed and coherence at 4M params. Pure ternary BitLinear attention hitting 11k tok/s on CPU, pushing toward coherent generation at a fraction of conventional model sizes.

Nano-Coder — state-based reasoning for code intelligence. An 8B coding agent built on SFM with delta rule memory, distilling code execution reasoning through 4-NPU pipeline parallelism on Ascend hardware.

AscendGo — competitive Go AI on OGS. 21 trained neural network checkpoints for 19x19 and a released 9x9 model at 8.5M self-play steps, all trained on Ascend NPUs. AscendGo approaching readiness for live OGS play against human opponents.

Recognition

Achievements

Featured releases, milestones, and community recognition across AI research, systems programming, and open source.

🌟
FeaturedJan 2025

FlashLM Featured on Reddit r/LocalLLaMA

Featured twice on one of the largest local AI communities. Thousands of views and discussions about ternary language models running on CPU hardware.

🤖
Milestone2026

Nano-Coder: 4-NPU Pipeline Parallelism

Knowledge distillation pipeline for 8B coding agent on Ascend 910ProA with runtime FlashAttention patch for unsupported CANN kernels. Delta rule memory for code execution reasoning.

MilestoneMar 2025

Bolt DM: 213+ MB/s Download Speed

Built a C++23 download accelerator that peaks at 4x faster than Internet Download Manager. Dynamic segmentation with work stealing and stalled recovery.

🔥
FeaturedOngoing

FlashLM 27 GitHub Stars

Open-source ternary language model repository earned 27 stars from the AI/ML community. Proof that the ternary paradigm resonates with researchers.

♟️
Release2024-2025

Two Chess Engines Shipped

Luminex (C++23, ~8,050 LOC) and Douchess (C++17) — both using bitboard architecture with superhuman search depth. Luminex features novel Phased Move Generation and self-engineered evaluation.

🧩
ReleaseFeb 2025

4 Chrome Extensions Published

DouGrammar, Doucite, Doulet AI Assistant, and Panic Button — shipped and used by real users for grammar checking, citations, AI answers, and tab management.

📝
MilestoneMar 2025

DouEssay: 98-99% Teacher Alignment

AI Writing Mentor that grades essays using Ontario curriculum rubrics with near-perfect alignment to teacher scoring. Real-time feedback and visual analytics.

🔮
ReleaseSep 2025

NEXUS v2: 7-Layer Private AI

Built a private AI assistant with Soul, Observer, Encoder, Memory, Resonator, Decoder, and Agent layers. Self-evolves through daily training on Ascend 910 NPUs.

MilestoneDec 2024

Go Neural Networks Trained from Scratch

Trained custom KataGo-compatible neural network weights (KW/KW9x9) from scratch via self-play on Huawei Ascend NPUs. 21 model checkpoints for 19x19.

“The best projects are the ones people tell you not to build. They said ternary models cannot work. They said transformers are the ceiling. They said a student cannot build engines that compete with decades-old projects. I built them anyway.”

Philosophy

No law is unbreakable

Every law in this field is an assumption waiting to be shattered. Here are four assumptions that I found, tested, and destroyed — each one backed by a shipped project.

“Every law in this field is an assumption waiting to be shattered. The question is never whether it can be broken — it is whether you have the audacity to try.”
C
Cheng Chang
AI Researcher & Systems Engineer
01

Question every assumption

They said you need float16 weights to build real language models.

FlashLM proved that three numbers — negative one, zero, and one — can achieve meaningful language modeling on a CPU. 24 GitHub stars, featured twice on Reddit r/LocalLLaMA. That assumption died with a 124M parameter model running on consumer hardware.

FlashLM
02

Find the architectural ceiling

They said transformers are the ceiling for sequence modeling — longer programs are architecturally impossible to generalize.

State-Flow Machine proved that explicit state tracking breaks through by 30x. At 4x training length: SFM achieves 62% while transformers collapse to ~2%. A 2.2M Transformer-Large performs no better than a 443K one — this is not a scale issue. It is an architectural wall.

State-Flow Machine
03

Build what they say cannot be built

A download manager that beats IDM? A Go AI trained from scratch? Private AI without cloud APIs?

Bolt DM hits 213+ MB/s — 4x faster than the industry standard. AscendGo trains neural networks from zero via self-play. NEXUS v2 runs 7 AI layers entirely on local Ascend NPUs, falling back to cloud only when absolutely necessary. Every one of these was "impossible" until it shipped.

Systems
04

The frontier is defined by what we question

Every project I build is an attempt to find the next assumption that is wrong.

Nano-Coder pushes knowledge distillation to its limits on Ascend 910ProA with 4-NPU pipeline parallelism. SFM Thinker-1.5B extends explicit state reasoning to full code synthesis with DeltaNet gates. FlashLM v8 "Nova" explores hybrid ternary-binary quantization. The frontier of intelligence is not defined by what we know is possible — it is defined by what we are willing to test, question, and ultimately destroy.

Research

The next generation of AI systems I build will continue to find where the current paradigm breaks — and build something that does not. This is not arrogance. This is the scientific method applied to engineering.

Development

How I build

My tools, models, and workflow for turning ideas into working systems.

Claude Code

Primary development tool — AI-powered terminal coding agent

🧠

Opus 4.7

Favorite model for complex reasoning, architecture design, and code generation

GLM-5.1

Go-to model for fast, high-quality development and problem solving

🔥

Huawei Ascend 910

NPU cluster for training AI models — the hardware that powers FlashLM, SFM, and Nano-Coder

📦

Git & GitHub

64 repositories, version control, open-source contributions

🪟

Windows 11

Primary development environment with WSL2, WPF/WinUI apps, and system-level programming

Workflow

1

Research problem deeply before writing code

2

Prototype with Claude Code (Opus 4.7 / GLM-5.1)

3

Train and iterate on Ascend 910 NPUs

4

Test rigorously, optimize performance

5

Ship and open-source when ready

Contact

Let's build something
extraordinary together

Interested in AI research collaboration, discussing architectures, contributing to open source, or just want to chat? I'm always open to interesting conversations.

What I'm open to

🔬

AI Research Collaboration

Novel architectures, post-transformer systems, ternary models, knowledge distillation

♟️

Game Engine Development

Chess, Go, or board game AI — neural network training, search algorithms, evaluation

🌐

Open Source Projects

Contributing to or building developer tools, AI infrastructure, and community projects

💬

Technical Discussions

System design, performance optimization, NPU/GPU computing, and architecture decisions

Currently available for collaboration
Response time: usually within 24h