Gemma 4 12B: A unified, encoder-free multimodal model

By Olivier Lacombe · Google (2026-06-03) · On Hacker News (2026-06-04)

916 points · 347 comments on HN · read original →

Points and comments are a snapshot, not live.

Google releases Gemma 4 12B, a 12 billion parameter multimodal model using direct projection instead of separate vision and audio encoders.

Gemma 4 12B is designed to run on consumer laptops with 16GB of RAM while supporting vision and audio inputs alongside text. The model replaces traditional multimodal encoders with lightweight linear projections: vision input uses a 35-parameter embedder with a single matrix multiplication and coordinate lookup, while audio is projected directly from raw signal without a dedicated encoder. Performance approaches Google's larger 26B Mixture of Experts model on standard benchmarks. The model is released under Apache 2.0 license and available on Hugging Face, Kaggle, and multiple inference platforms including LM Studio, Ollama, and llama.cpp. It includes Multi-Token Prediction drafters to reduce latency.

What commenters are saying

Top commenters questioned the "encoder-free" framing, noting that linear projection and embedding layers still technically encode data. The dominant concern: performance claims use bf16 precision, but the "16GB laptop" claim likely requires int8 quantization, which causes quality loss. Commenters found 12GB at 8-bit reasonable for consumer use. The architecture itself draws comparison to FAIR's 2024 Chameleon and EVE models using early fusion. One practical note: unlike previous Gemma versions, the model no longer needs a separate .mmproj file in llama.cpp since the projection is built in.