DiffusionGemma: 4x Faster Text Generation

311 points · 86 comments on HN · read original →

Points and comments are a snapshot, not live.

Google releases DiffusionGemma, a 26B model generating text 4x faster via parallel diffusion instead of sequential tokens.

DiffusionGemma, released under Apache 2.0, generates 256 tokens in parallel per forward pass rather than one token at a time. On an H100 GPU, it achieves over 1000 tokens per second (700+ on RTX 5090). The 26B mixture-of-experts model activates only 3.8B parameters during inference, fitting within 18GB VRAM when quantized. It uses bi-directional attention to refine entire text blocks iteratively, enabling tasks like in-line editing and code infilling. Trade-offs include lower output quality than standard Gemma 4 and diminished benefits in high-throughput cloud serving where autoregressive batching already saturates compute. The model is optimized for local, single-user inference on dedicated GPUs.

What commenters are saying

Commenters emphasize DiffusionGemma's niche: strong for local inference but weak for cloud serving. One user notes diffusion's speedup vanishes under batching, where autoregressive models already fully utilize hardware, making diffusion potentially more expensive at scale. Several threads explore whether quality loss is offset by ability to run larger diffusion models locally. A developer reported positive real-world experience with fast diffusion models for pair-programming workflows, valuing speed and interactivity over maximum quality. Skepticism centers on whether the quality gap versus autoregressive models can close enough to justify research investment if cloud deployment remains unprofitable.