Google releases experimental open-source DiffusionGemma model that generates text up to 4x faster using diffusion

Google released DiffusionGemma, an open-source model generating text four times faster than LLMs. It activates just 3.8 billion parameters, reducing local hardware bottlenecks.

Published on: Jun 13, 2026
Google releases experimental open-source DiffusionGemma model that generates text up to 4x faster using diffusion

Google released DiffusionGemma, an experimental open-source model that generates text up to four times faster than standard large language models. By drafting entire passages simultaneously instead of processing tokens sequentially, the 26-billion-parameter model reduces hardware bottlenecks for developers running local AI workloads.

Parallel text generation

Traditional large language models process text in a simple left-to-right fashion. DiffusionGemma applies image-generation diffusion techniques to text, beginning with a canvas of random placeholder tokens and refining them in multiple passes.

"It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously," Google research scientists Brendan O'Donoghue and Sebastian Flennerhag said in a blog post.

The model activates only 3.8 billion parameters during inference. When quantized, it fits within 18GB of VRAM on high-end consumer GPUs like the Nvidia RTX 5090. This architecture allows developers to deploy capable Generative AI and LLM tools on local machines without relying on cloud infrastructure.

Self-correction and coding workflows

The model uses bidirectional attention to improve accuracy. Generating 256 tokens in parallel with each forward pass allows every token to attend to all others, and the system uses confidence scoring to re-evaluate and fix mistakes in real time.

This parallel structure suits non-linear tasks like mathematical graphs, code infilling, and inline editing. Technology analyst Carmi Levy said the model is particularly well suited for Generative Code workflows, where its efficiency allows rapid processing and iterations.

Levy also noted the model incorporates a thinking mode adept at problem solving. Google fine-tuned the model to play Sudoku, a task that typically challenges autoregressive models because each token depends on future tokens.

Limitations and deployment

Google designed the model for small batch sizes and low-latency generation on a single capable accelerator. In high-QPS cloud serving environments, the parallel processing offers diminishing returns and can increase serving costs.

The overall output quality is lower than standard Gemma 4, which is built for applications demanding maximum quality. However, Levy said subsequent refinement cycles could overcome this precision limitation in specific workloads.

Released under the Apache 2.0 license, DiffusionGemma is available on Hugging Face, GitHub, vLLM, Google Cloud Model Garden, and Nvidia NIM. Support for the open-source library llama.cpp is coming soon.

Why this matters for IT and development professionals

Developers managing local AI deployments can reduce inference costs and hardware overhead by replacing sequential token generation with parallel diffusion. While it is not a replacement for high-quality cloud models, it provides a practical, low-latency option for internal coding assistants and specialized, non-linear text tasks running on standard workstations.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)