Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

366 points · 110 comments on HN · read original →

Google releases Gemma 4 models optimized with Quantization-Aware Training to run efficiently on mobile and consumer hardware.

Google released QAT checkpoints for Gemma 4 models designed to reduce memory footprint while preserving model quality. The E2B model requires under 1GB in text-only format using a custom mobile quantization schema. Key optimizations include static activations pre-calculated during training, channel-wise quantization for mobile accelerators, targeted 2-bit quantization on token generation layers, and KV cache optimization. Standard Q4_0 quantization variants are also provided. Models are available on Hugging Face in GGUF and compressed tensor formats, with integration support for llama.cpp, Ollama, LM Studio, vLLM, SGLang, and other developer tools. Google also provides LiteRT-LM runtime for edge deployment and Transformers.js for web deployment.

What HN community is saying

Commenters highlighted confusion from rapid consecutive releases: Gemma 4 base models, MTP variants, the 12B model, and now QAT versions within three weeks. This creates downstream work for projects like llama.cpp that must support each variant. One developer noted GGUF files were missing from repos initially despite the blog claiming them. Separate threads praise practical results: the E2B model runs on-device with audio and vision support in 3.2GB, and Unsloth's quantized versions reportedly achieve similar accuracy to Google's QAT baselines. Disagreement arose over whether Unsloth's results represent improvement over Google's QAT or simply different packaging of the same trained checkpoint.