p.enthalabs

Show HN: Running Gemma-4 26B at 124 tokens/SEC on a CPU, no GPU

apeg.dev · Read Story HN original

I wanted to know how fast a 26B mixture-of-experts model could run on a desktop CPU with no GPU. Got ~40 tok/s single-stream (lossless) and ~124 batched. The surprising part was the byte budget: for this model you compress the output head (32% of per-token bytes), not the experts (16%). The writeup has the bandwidth roofline and the dead-ends; the repo has the reproducible recipe. Happy to answer questions.

Repo: https://github.com/arun-prasath2005/gemma4-cpu-moe

Comments

The output head byte budget is surprising. Did you try any tradeoff where the head is compressed more aggressively but experts stay mostly untouched?