I build intelligent systems
& high-performance engines
Software engineer specializing in AI/ML research, game engines, and systems programming. Currently
About
Driven by curiosity,
powered by code
I'm a student and software engineer with a deep passion for building systems that push boundaries. My work spans from low-level C++ chess engines and Go neural networks to cutting-edge ternary language models and kernel drivers.
I'm the creator of FlashLM (27 GitHub stars), a CPU-native ternary language model spanning 162 commits and 8 phases — proving Gated DeltaNet achieves 3.54x better PPL than transformers on identical data. My chess engine Luminex introduces novel Phased Move Generation with a fully self-engineered evaluation in ~8,050 lines of C++23. I train custom KataGo neural networks from scratch via self-play on Ascend NPUs.
I built WinPods — bringing the native AirPods experience to Windows with a custom KMDF kernel driver for noise control, TeachAssist for both Desktop and Android, and open-source MCP tooling. I develop primarily with Claude Code, powered by my favorite models — Opus 4.7 and GLM-5.1.
At a glance
“Understand the system deeply, then build it to perform flawlessly.”
AI & Machine Learning
Ternary LLMs (FlashLM, 27★), custom KataGo neural networks trained via self-play on Ascend NPUs, post-transformer architectures (Gated DeltaNet), knowledge distillation
Game Engines
Luminex — C++23 chess engine with novel Phased Move Generation and fully self-engineered evaluation. ~7,300 LOC, zero dependencies, cross-platform
Desktop Applications
TeachAssist Desktop (C#/.NET 10), TeachAssist Android (Kotlin/Compose), WinPods (AirPods for Windows with custom KMDF kernel driver), WPF/WinUI 3
Tools & Infrastructure
MCP servers, AI-powered web platforms, Chrome extensions, LLM-friendly code browsing (RepoBeam), NPU-accelerated training pipelines
Every law in this field is an assumption waiting to be shattered.
You need float16 weights to build real language models? FlashLM proved that three numbers — negative one, zero, and one — can do the same on a CPU. That assumption died.
Transformers are the ceiling for sequence modeling? Gated DeltaNet proved targeted correction memory achieves 3.54x better PPL at identical scale. That assumption died too.
Projects
Selected work
A curated collection of projects spanning AI research, systems programming, game engines, and desktop applications.
FlashLM
CPU-native language models trained entirely from scratch — no GPUs, no pretraining. Exploring ternary quantization, Gated DeltaNet, and test-time search to push the limits of what small models can achieve on free-tier hardware.
162 commits across 8 development phases. 7 models on HuggingFace. Every experiment documented — including all failures.
Weights constrained to {-1, 0, +1}. Proved 1.58-bit quantization converges at small scale.
Delta rule memory (M += β(v - M·k)⊗k) performs targeted correction. 3.54x better PPL than transformer baseline.
GRU + VQ-VAE codebook for explicit entity state tracking across sentence boundaries. Latest innovation.
Best PPL on free-tier CPU: 2.33 with 6.6M params in 2h. Beat transformer baseline by 3.54x.
AlphaGo-inspired test-time compute. Value heads genuinely learned (V_Corr +0.66), but search doesn't fix coherence.
Every failure documented. Reckoning at PPL 130, RWKV at 377, Story Compass at 17.56. No cherry-picking.
Lean ternary attention architecture achieving 11k tok/s on CPU with pure BitLinear projections — 3x speedup over standard linear layers.
PPL Evolution
Luminex
A world-class UCI chess engine written in modern C++23. Features a novel Phased Move Generation optimization and a fully self-engineered hand-crafted evaluation — every PST value, mobility coefficient, and king safety parameter derived from chess first principles.
~8,050 lines of code. Zero external dependencies. Cross-platform: Linux, Windows, macOS (Apple Silicon native).
Novel optimization: generates moves in priority phases (TT → captures → quiets). ~70-80% of positions cut off before quiet gen is ever invoked.
1,402 lines of hand-crafted evaluation — no values borrowed from PeSTO, Ethereal, or any other engine. All weights derived from chess first principles.
PVS with LMR, null move pruning, singular extension, probCut, razoring, quiescence with SEE, and mate distance pruning. 2,265 lines of search logic.
5 pre-built binaries (Linux AVX2/SSE, Windows ClangCL/MSVC, macOS ARM64). CMake with auto-detected SIMD. ~8,050 LOC total, zero dependencies.
Search & Evaluation Arsenal
AscendGo
Go AI with custom KataGo-compatible neural networks trained from scratch via self-play. 21 checkpoints for 19x19 across 3 architectural generations (18b/384ch, 28b/512ch, refined), plus a dedicated 9x9 pipeline with a released model at 8.5M self-play steps. Deployed via custom C++ GTP engine for OGS and CGOS competitive play.
TeachAssist Desktop
Polished Windows 11 desktop app for YRDSB students. Bento-grid dashboard, grade trend charts, What-If calculator, grade goals, CSV/HTML export, Ontario course code decoding, and auto-login via Windows Credential Manager. v5.0 with fully custom components.
TeachAssist Android
Material Design 3 Android companion app. Spring-animated Grade Ring, background grade-change notifications via WorkManager, biometric login, What-If calculator, confetti celebrations, AMOLED/Dynamic Color theming, and offline disk caching. v2.9.0 targeting SDK 35.
WinPods
Brings the native AirPods experience to Windows. iOS-style translucent battery popup, auto-connect on case open, media controls, ear detection, and full noise control (ANC/Transparency/Adaptive) via a custom KMDF kernel driver for L2CAP access. Supports all AirPods and Beats with W1/H1/H2 chips.
free_web_tools
Comprehensive MCP server with 14 tools for web search, deep research, GitHub integration, code search, and package lookups. Multi-backend search (DDG + Mojeek + Bing + Startpage), content extraction, Wikipedia, and auto-answer synthesis. Zero API keys needed. v5.0.
Under the Hood
Code that speaks
Real code from real projects. Every snippet is production code that shipped — not pseudocode, not examples.
The core insight: replace float16 weights with {-1, 0, +1} — enabling CPU-native inference at extreme speed.
Skills
What I work with
Languages, frameworks, and platforms I use across my projects -- from low-level systems to large language models.
LanguagesLanguages
AI / MLAI / ML
FrameworksFrameworks
SpecializationsSpecializations
Technology Map
Tech Constellation
Hover over a node to explore connections
Now
What I'm working on
Active projects and research directions — pushing boundaries across AI, game engines, and autonomous agents.
FlashLM v10
Developing BitLinear attention architecture for CPU-native ternary LMs. v10 achieves 11k tok/s with pure BitLinear projections.
Stripped to lean BitLinear attention with d=256, L=4, H=4, ~3.9M params. All projections ternary. Standard causal attention replacing Gated DeltaNet for simplicity and speed. Training on TinyStories V2-GPT4 full train split.
AscendGo
Bug fixing, benchmarking visits-per-second search performance, and testing on OGS for real-world competitive play with custom KataGo neural networks.
21 trained checkpoints for 19x19 across 3 architectural generations. Dedicated 9x9 pipeline with released model at 8.5M self-play steps. Fixing critical search and evaluation bugs. Benchmarking visits-per-second on the machine. Preparing for OGS integration via GTP protocol.
Nano-Coder
Knowledge distillation pipeline for 8B coding agent on Ascend 910ProA with 4-NPU pipeline parallelism.
Building NC-1 Preview with pipeline validation on OpenI. Architecture based on SFM (State-Flow Machine) with delta rule memory for code execution reasoning. Runtime FlashAttention patch for unsupported CANN kernels. Targeting best 8B coding agent.
Future Vision
FlashLM v10 — bridging the gap between speed and coherence at 4M params. Pure ternary BitLinear attention hitting 11k tok/s on CPU, pushing toward coherent generation at a fraction of conventional model sizes.
Nano-Coder — state-based reasoning for code intelligence. An 8B coding agent built on SFM with delta rule memory, distilling code execution reasoning through 4-NPU pipeline parallelism on Ascend hardware.
AscendGo — competitive Go AI on OGS. 21 trained neural network checkpoints for 19x19 and a released 9x9 model at 8.5M self-play steps, all trained on Ascend NPUs. AscendGo approaching readiness for live OGS play against human opponents.
Recognition
Achievements
Featured releases, milestones, and community recognition across AI research, systems programming, and open source.
FlashLM Featured on Reddit r/LocalLLaMA
Featured twice on one of the largest local AI communities. Thousands of views and discussions about ternary language models running on CPU hardware.
Nano-Coder: 4-NPU Pipeline Parallelism
Knowledge distillation pipeline for 8B coding agent on Ascend 910ProA with runtime FlashAttention patch for unsupported CANN kernels. Delta rule memory for code execution reasoning.
Bolt DM: 213+ MB/s Download Speed
Built a C++23 download accelerator that peaks at 4x faster than Internet Download Manager. Dynamic segmentation with work stealing and stalled recovery.
FlashLM 27 GitHub Stars
Open-source ternary language model repository earned 27 stars from the AI/ML community. Proof that the ternary paradigm resonates with researchers.
Two Chess Engines Shipped
Luminex (C++23, ~8,050 LOC) and Douchess (C++17) — both using bitboard architecture with superhuman search depth. Luminex features novel Phased Move Generation and self-engineered evaluation.
4 Chrome Extensions Published
DouGrammar, Doucite, Doulet AI Assistant, and Panic Button — shipped and used by real users for grammar checking, citations, AI answers, and tab management.
DouEssay: 98-99% Teacher Alignment
AI Writing Mentor that grades essays using Ontario curriculum rubrics with near-perfect alignment to teacher scoring. Real-time feedback and visual analytics.
NEXUS v2: 7-Layer Private AI
Built a private AI assistant with Soul, Observer, Encoder, Memory, Resonator, Decoder, and Agent layers. Self-evolves through daily training on Ascend 910 NPUs.
Go Neural Networks Trained from Scratch
Trained custom KataGo-compatible neural network weights (KW/KW9x9) from scratch via self-play on Huawei Ascend NPUs. 21 model checkpoints for 19x19.
“The best projects are the ones people tell you not to build. They said ternary models cannot work. They said transformers are the ceiling. They said a student cannot build engines that compete with decades-old projects. I built them anyway.”
Philosophy
No law is unbreakable
Every law in this field is an assumption waiting to be shattered. Here are four assumptions that I found, tested, and destroyed — each one backed by a shipped project.
“Every law in this field is an assumption waiting to be shattered. The question is never whether it can be broken — it is whether you have the audacity to try.”
Question every assumption
They said you need float16 weights to build real language models.
FlashLM proved that three numbers — negative one, zero, and one — can achieve meaningful language modeling on a CPU. 24 GitHub stars, featured twice on Reddit r/LocalLLaMA. That assumption died with a 124M parameter model running on consumer hardware.
Find the architectural ceiling
They said transformers are the ceiling for sequence modeling — longer programs are architecturally impossible to generalize.
State-Flow Machine proved that explicit state tracking breaks through by 30x. At 4x training length: SFM achieves 62% while transformers collapse to ~2%. A 2.2M Transformer-Large performs no better than a 443K one — this is not a scale issue. It is an architectural wall.
Build what they say cannot be built
A download manager that beats IDM? A Go AI trained from scratch? Private AI without cloud APIs?
Bolt DM hits 213+ MB/s — 4x faster than the industry standard. AscendGo trains neural networks from zero via self-play. NEXUS v2 runs 7 AI layers entirely on local Ascend NPUs, falling back to cloud only when absolutely necessary. Every one of these was "impossible" until it shipped.
The frontier is defined by what we question
Every project I build is an attempt to find the next assumption that is wrong.
Nano-Coder pushes knowledge distillation to its limits on Ascend 910ProA with 4-NPU pipeline parallelism. SFM Thinker-1.5B extends explicit state reasoning to full code synthesis with DeltaNet gates. FlashLM v8 "Nova" explores hybrid ternary-binary quantization. The frontier of intelligence is not defined by what we know is possible — it is defined by what we are willing to test, question, and ultimately destroy.
The next generation of AI systems I build will continue to find where the current paradigm breaks — and build something that does not. This is not arrogance. This is the scientific method applied to engineering.
Development
How I build
My tools, models, and workflow for turning ideas into working systems.
Claude Code
Primary development tool — AI-powered terminal coding agent
Opus 4.7
Favorite model for complex reasoning, architecture design, and code generation
GLM-5.1
Go-to model for fast, high-quality development and problem solving
Huawei Ascend 910
NPU cluster for training AI models — the hardware that powers FlashLM, SFM, and Nano-Coder
Git & GitHub
64 repositories, version control, open-source contributions
Windows 11
Primary development environment with WSL2, WPF/WinUI apps, and system-level programming
Workflow
Research problem deeply before writing code
Prototype with Claude Code (Opus 4.7 / GLM-5.1)
Train and iterate on Ascend 910 NPUs
Test rigorously, optimize performance
Ship and open-source when ready
Contact
Let's build something
extraordinary together
Interested in AI research collaboration, discussing architectures, contributing to open source, or just want to chat? I'm always open to interesting conversations.
What I'm open to
AI Research Collaboration
Novel architectures, post-transformer systems, ternary models, knowledge distillation
Game Engine Development
Chess, Go, or board game AI — neural network training, search algorithms, evaluation
Open Source Projects
Contributing to or building developer tools, AI infrastructure, and community projects
Technical Discussions
System design, performance optimization, NPU/GPU computing, and architecture decisions