FlashLM
CPU-native language models trained from scratch on free-tier hardware.
30+ experiments training language models entirely on 4-vCPU free-tier CPUs — no GPU, no pretraining — on TinyStories. The project surfaces an uncomfortable finding: the training objective matters far more than architecture, and lower perplexity does not produce coherent text.
val PPL on TinyStories — lower is better. Best PPL ≠ best coherence.
key innovations
Weights constrained to {-1, 0, +1}; proved 1.58-bit quantization converges at small scale.
Delta-rule memory M += β(v − M·k)⊗k performs targeted correction rather than blind accumulation — 3.54× better PPL than the transformer baseline.
An O(n) linear-attention cumsum replacing O(n²) attention — 15× cheaper per layer (136µs vs 2,062µs).
Product-Softmax addressing expands to 512 virtual slots with Top-8 sparse read/write.
Predict a bag-of-words 16 tokens ahead via a 65K-param head — the project’s largest single gain (2.5×).
No FlashLM model achieves true narrative coherence yet — every sub-16.8M-param model breaks down ~100 tokens in. v5 "Thunderbolt" (PPL 1.36) is the only readable model, but required 40h on hardware beyond free-tier. Every failure is documented honestly.