A 10 year old Xeon is all you need

540 points · 234 comments on HN · read original →

Running a 26B parameter language model at reading speed on a 2016 Xeon with 128GB DDR3 RAM and no GPU using speculative decoding and CPU-specific optimizations.

A developer successfully ran Gemma 4's 26B-parameter mixture-of-experts model on recycled server hardware: an Intel Xeon E5-2620 v4 (8 cores, 2.1 GHz) with 128GB DDR3 RAM and no GPU. The key bottleneck is memory bandwidth, not compute. The solution combines speculative decoding (pairing a small drafter model with a verifier), CPU-optimized MoE routing to minimize cache thrashing, Flash Attention ported to CPU, Multi-Head Latent Attention for KV cache compression, memory pinning, and runtime weight matrix repacking. The 82GB footprint (25GB weights, 56GB KV cache at full 262K context) runs at approximately 12-20 tokens per second during evaluation. The approach required 25 command-line flags, most undocumented, highlighting a usability gap between blackbox tools like Ollama and specialized inference engines like ik_llama.cpp.

What HN community is saying

Commenters identified a factual inconsistency: Intel Xeon E5-2620 v4 specifications list DDR4-only support, not DDR3, contradicting the article's claim. The author acknowledged this may reflect an OEM variant or misidentification of the processor generation. On substance, readers questioned practical utility: at 12-20 tokens per second, one commenter noted eval-phase throughput is too low for processing meaningful amounts of text compared to GPU baselines (near 1000 tokens/s). The author clarified the metric reports evaluation time under system load and suggested the machine remains useful for occasional tasks or retro computing. Environmental efficiency was debated: reusing old hardware versus energy costs of long-running systems.