Software engineer specializing in AI/ML research, game engines, and systems programming. Currently
About
I'm a student and software engineer with a deep passion for building systems that push boundaries. My work spans from low-level C++ chess engines and Go neural networks to cutting-edge ternary language models and private AI assistants.
I'm the creator of FlashLM (24 GitHub stars), a CPU-native ternary language model that proved ternary weights can match float16 performance. I built the State-Flow Machine, a novel post-transformer architecture that generalizes 30x better than standard transformers on long programs. My chess engines Luminex and Douchess compete at superhuman search depth with bitboard-based evaluation.
I train custom Go networks on KataGo from scratch, built a download manager that hits 213+ MB/s, created an essay grader with 98-99% teacher alignment, and built 4 Chrome extensions used by real users. I develop primarily with Claude Code, powered by my favorite models — Opus 4.6 and GLM 5 Turbo.
“Understand the system deeply, then build it to perform flawlessly.”
Ternary LLMs (FlashLM, 24★), knowledge distillation, neural network training on Huawei Ascend NPUs, multi-agent systems, post-transformer architectures
Two chess engines (Luminex in C++23, Douchess in C++17) with bitboard architecture, Go AI with custom KataGo neural networks trained via self-play
Bolt Download Manager (213+ MB/s peak, 4x faster than IDM), parallel computing, NPU acceleration, C# desktop apps with Fluent Design
Next.js, TypeScript, Python, 4 Chrome extensions (DouGrammar, Doucite, Doulet AI, Panic Button), AI-powered web platforms, MCP servers
Every law in this field is an assumption waiting to be shattered.
You need float16 weights to build real language models? FlashLM proved that three numbers — negative one, zero, and one — can do the same on a CPU. That assumption died.
Transformers are the ceiling for sequence modeling? State-Flow Machine proved explicit state tracking breaks through by 30x. That assumption died too.
Projects
A curated collection of projects spanning AI research, systems programming, game engines, and web development.
A novel post-transformer architecture for code intelligence that replaces the single-transformer paradigm with 4 specialized systems. The core insight: coding is about state transformations — what a program does vs what it should do — and explicit state tracking generalizes to longer programs in ways that implicit token-level models provably cannot (TC0 circuit complexity limit, Siems et al. ICLR 2025).
Every 2 perception layers, all 4 systems synchronize via projection to a shared 256d space with learned gates. Final output is a learned weighted combination of all system outputs (initialized at 25% each).
Task: Predict final value of target variable after arithmetic operations. Trained on 10-27 ops, evaluated at up to 32x length.
| Length | State Slots | Transformer-Fair | Transformer-Large |
|---|---|---|---|
| 1x | 99.9% | 100.0% | 100.0% |
| 2x | 92.9% | 99.0% | 99.5% |
| 4x | 62.0% | 1.9% | 3.1% |
| 8x | 35.3% | 1.3% | 1.0% |
| 16x | 5.1% | 0.9% | 0.7% |
| 32x | 5.0% | 1.0% | 0.8% |
SFM retains 62% at 4x length while transformers collapse to ~2% — a 30x generalization gap. The 2.2M Transformer-Large performs no better than 443K Transformer-Fair: this is an architectural limitation, not a scale issue.
Fine-tuning Qwen2.5-Coder-1.5B with DeltaNet SFM blocks for code execution reasoning. Simple delta rule: S = S - β(S@k - v)@kT, 16 heads × 16×16 state, inserted after layers 6, 13, 20, 27.
Multi-loss: masked CE + 0.1×judge_BCE + 0.01×surprise_MSE. Self-evolution via EWMA difficulty adaptation. Synthetic exec() traces + debugging samples generated on-the-fly. Trained on 4× Ascend 910 with MindSpore 2.2 + CANN 7.
CPU-Native Ternary Language Models — proving that {-1, 0, +1} weights can match float16 performance. v7 "Eclipse" is a 124M parameter BitNet b1.58 transformer trained on FineWeb-Edu, featuring ARM NEON/OpenMP kernels and Ascend NPU acceleration. Featured on Reddit r/LocalLLaMA.
World-class classical chess engine in C++23. Features LMR, null move pruning, singular extension, aspiration windows, SEE-based quiescence search, and comprehensive evaluation with PST, mobility, pawn structure, and king safety. ~7000 LOC, ~280KB binary.
7-layer private AI assistant: Soul (immutable constitution), Observer (Windows UI automation via pywinauto), Encoder (Transformer), Memory (FAISS + SQLite), Resonator (retrieval-reasoning), Decoder (GRU), Agent (trust levels). Self-evolves through daily training on Ascend 910 NPUs.
Building the world's best 8B coding agent through knowledge distillation. Fine-tunes Qwen3-8B with LoRA on coding trajectories (SWE-bench, CoderForge) using 4-NPU pipeline parallelism on Ascend 910ProA, with custom FlashAttention patch.
High-performance C++17 chess engine using 64-bit bitboard architecture. Robust search algorithm with deeply tuned handcrafted evaluation. Represents the "Classical" development era with 16M-node transposition table and 0-950ms variable search time.
GTP engine with self-play learning in Go. Custom KataGo-compatible neural network weights (KW/KW9x9) trained from scratch with extensive self-play, achieving competitive performance in both 9x9 and 19x19. 21 model checkpoints for 19x19.
High-performance download accelerator in C++23 with dynamic segmentation (16-32 segments), work stealing, and stalled segment recovery. HTTP/2 support, Windows async I/O. Peaks at 213+ MB/s — 4x faster than IDM. Qt6 GUI + CLI + browser extension.
LLM-friendly code browsing platform. Indexes GitHub repos, extracts symbols using LLM (Qwen2.5-Coder-32B via NVIDIA NIM), and serves code in token-efficient chunks with symbol navigation and code search. Designed for AI agents.
Modern Windows app for YRDSB students. Fluent Design UI with Mica backdrop, grade trends visualization (ScottPlot), What-If calculator, grade goals, CSV/HTML export, course code decoding, and school name extraction. Built with WPF and .NET 10.
Brings the native AirPods experience to Windows with iOS 26 Liquid Glass UI. BLE battery monitoring, auto-connect on case open, media controls (play/pause from system tray), low battery alerts, and system tray integration.
Advanced grammar checking Chrome extension with real-time checking, spelling correction, style suggestions, and readability analysis. Supports 15+ languages and multiple AI providers (DeepSeek, OpenAI, Google, Anthropic, Qwen).
One-click citation generator Chrome extension. Layered metadata extraction (citation/DC meta, JSON-LD, OG tags, visible text, regex) with APA 7, MLA 9, Chicago formatting and BibTeX export. Special handling for government/academic sites.
Browser extension providing AI-powered answers on any webpage. Highlight text and get comprehensive educational responses. Uses NVIDIA NIM as primary API with OpenRouter fallback. 10+ free models, custom prompts, multi-language.
Next.js application platform using NVIDIA NIM APIs for AI-powered capabilities. Modern web stack with TypeScript and ESLint integration.
Zero-cost MCP server providing web search and content extraction. 4 tools: web_search, fetch_url, news_search, related_searches. Uses SearXNG public instances and Jina AI Reader. No API keys needed.
Browser extension that instantly hides all tabs except saved "important" ones. Save important websites, close all others with "PANIC" mode, keyboard shortcuts (Ctrl+Shift+X), context menu integration.
Under the Hood
Real code from real projects. Every snippet is production code that shipped — not pseudocode, not examples.
The core insight: replace float16 weights with {-1, 0, +1} — enabling CPU-native inference at extreme speed.
Philosophy
Every law in this field is an assumption waiting to be shattered. Here are four assumptions that I found, tested, and destroyed — each one backed by a shipped project.
“Every law in this field is an assumption waiting to be shattered. The question is never whether it can be broken — it is whether you have the audacity to try.”
They said you need float16 weights to build real language models.
FlashLM proved that three numbers — negative one, zero, and one — can achieve meaningful language modeling on a CPU. 24 GitHub stars, featured twice on Reddit r/LocalLLaMA. That assumption died with a 124M parameter model running on consumer hardware.
They said transformers are the ceiling for sequence modeling — longer programs are architecturally impossible to generalize.
State-Flow Machine proved that explicit state tracking breaks through by 30x. At 4x training length: SFM achieves 62% while transformers collapse to ~2%. A 2.2M Transformer-Large performs no better than a 443K one — this is not a scale issue. It is an architectural wall.
A download manager that beats IDM? A Go AI trained from scratch? Private AI without cloud APIs?
Bolt DM hits 213+ MB/s — 4x faster than the industry standard. AscendGo trains neural networks from zero via self-play. NEXUS v2 runs 7 AI layers entirely on local Ascend NPUs, falling back to cloud only when absolutely necessary. Every one of these was "impossible" until it shipped.
Every project I build is an attempt to find the next assumption that is wrong.
Nano-Coder pushes knowledge distillation to its limits on Ascend 910ProA with 4-NPU pipeline parallelism. SFM Thinker-1.5B extends explicit state reasoning to full code synthesis with DeltaNet gates. FlashLM v8 "Nova" explores hybrid ternary-binary quantization. The frontier of intelligence is not defined by what we know is possible — it is defined by what we are willing to test, question, and ultimately destroy.
The next generation of AI systems I build will continue to find where the current paradigm breaks — and build something that does not. This is not arrogance. This is the scientific method applied to engineering.
Recognition
Featured releases, milestones, and community recognition across AI research, systems programming, and open source.
Featured twice on one of the largest local AI communities. Thousands of views and discussions about ternary language models running on CPU hardware.
State-Flow Machine achieved 62% accuracy at 4x length vs ~2% for transformers. Published experiment results confirming architectural advantage over scale.
Built a C++23 download accelerator that peaks at 4x faster than Internet Download Manager. Dynamic segmentation with work stealing and stalled recovery.
Open-source ternary language model repository earned 24 stars from the AI/ML community. Proof that the ternary paradigm resonates with researchers.
Luminex (C++23, ~7000 LOC) and Douchess (C++17) — both using bitboard architecture with superhuman search depth. Luminex features LMR, null move pruning, singular extension.
DouGrammar, Doucite, Doulet AI Assistant, and Panic Button — shipped and used by real users for grammar checking, citations, AI answers, and tab management.
AI Writing Mentor that grades essays using Ontario curriculum rubrics with near-perfect alignment to teacher scoring. Real-time feedback and visual analytics.
Built a private AI assistant with Soul, Observer, Encoder, Memory, Resonator, Decoder, and Agent layers. Self-evolves through daily training on Ascend 910 NPUs.
Trained custom KataGo-compatible neural network weights (KW/KW9x9) from scratch via self-play on Huawei Ascend NPUs. 21 model checkpoints for 19x19.
“The best projects are the ones people tell you not to build. They said ternary models cannot work. They said transformers are the ceiling. They said a student cannot build engines that compete with decades-old projects. I built them anyway.”
Now
Active projects and research directions — pushing boundaries across AI, game engines, and autonomous agents.
Fine-tuning Qwen2.5-Coder-1.5B with DeltaNet SFM blocks for code execution reasoning — the second major experiment validating the State-Flow Machine architecture.
Training on 4x Ascend 910 NPUs with the full SFM architecture. Multi-loss: masked CE + 0.1x judge_BCE + 0.01x surprise_MSE. Self-evolution via EWMA difficulty adaptation. Synthetic exec() traces and debugging samples generated on-the-fly.
Bug fixing, benchmarking visits-per-second search performance, and testing on OGS (Online Go Server) for real-world competitive play validation.
Fixing critical search and evaluation bugs found during self-play testing. Benchmarking how many visits per second the search can achieve on the machine to measure raw performance. Preparing for OGS integration via GTP protocol for live testing against human players.
Making the private AI assistant better — improving layer architecture, response quality, and getting closer to OpenClaw-level autonomous agent capabilities.
Refining the 7-layer architecture (Soul, Observer, Encoder, Memory, Resonator, Decoder, Agent) for better tool use and reasoning. Implementing OpenClaw-inspired autonomous action patterns — multi-step task execution, file system navigation, and self-healing error recovery. Daily self-evolution training on Ascend NPUs.
The next frontier: SFMs that reason — extending the State-Flow Machine from arithmetic to full code synthesis, debugging, and self-improvement. The 4-system architecture is not just a research artifact; it is a blueprint for how AI should think.
Ternary models at scale — proving that FlashLM's ternary weight paradigm extends beyond 124M. If a model can think in three states, it can think in any state. The question was never "can ternary work?" — it was "how far can we push it?"
Autonomous agents on consumer hardware — NEXUS approaching OpenClaw-level capabilities with full Ascend NPU acceleration, self-evolving daily, requiring zero cloud APIs. Private AI that runs entirely on your desk.
Open Source
Every project is open-source and publicly available. Research, code, and training configurations — nothing hidden behind paywalls.
Every major project ships on GitHub with MIT or permissive licensing. Research, code, and training scripts are public.
FlashLM went through 8 versions. Douchess v1 evolved into Luminex. Each release is better than the last because of rapid iteration.
Comprehensive READMEs with architecture explanations, experiment results, and reproducible training configurations.
Community feedback from Reddit, GitHub issues, and discussions directly shapes project direction and priorities.
CPU-native ternary language models — proving {-1, 0, +1} weights can match float16
High-performance C++17 chess engine with bitboard architecture
Private AI assistant with 7-layer architecture and self-evolution
World-class classical chess engine in C++23 with advanced search
Go AI with custom KataGo neural networks trained via self-play
High-performance download accelerator — 213+ MB/s peak speed
State-Flow Machine — post-transformer architecture for code intelligence
8B coding agent via knowledge distillation on Ascend NPUs
GitHub
Skills
Languages, frameworks, and platforms I work with across my projects.
20 skills across 4 domains
Development
My tools, models, and workflow for turning ideas into working systems.
Primary development tool — AI-powered terminal coding agent
Favorite model for complex reasoning, architecture design, and code generation
Go-to model for fast, high-quality development and problem solving
NPU cluster for training AI models — the hardware that powers FlashLM, SFM, and Nano-Coder
64 repositories, version control, open-source contributions
Primary development environment with WSL2, WPF/WinUI apps, and system-level programming
Research problem deeply before writing code
Prototype with Claude Code (Opus 4.6 / GLM 5 Turbo)
Train and iterate on Ascend 910 NPUs
Test rigorously, optimize performance
Ship and open-source when ready
Contact
Interested in AI research collaboration, discussing architectures, contributing to open source, or just want to chat? I'm always open to interesting conversations.
Novel architectures, post-transformer systems, ternary models, knowledge distillation
Chess, Go, or board game AI — neural network training, search algorithms, evaluation
Contributing to or building developer tools, AI infrastructure, and community projects
System design, performance optimization, NPU/GPU computing, and architecture decisions