June 11, 2026
On June 10, 2026, Google released DiffusionGemma 26B-A4B, an experimental open-weights model from Google DeepMind that generates text through diffusion rather than predicting one token at a time. DiffusionGemma is built on the Gemma 4 26B-A4B Mixture-of-Experts architecture and denoises blocks of 256 tokens in parallel, reaching more than 1,000 tokens per second on a single NVIDIA H100 and up to 4x faster generation than comparable Gemma models. It is released under the Apache 2.0 license, making it free to use commercially and to run locally.
What is DiffusionGemma 26B-A4B and how does text diffusion work?
DiffusionGemma 26B-A4B is Google DeepMind’s first open text diffusion model. Where standard language models such as Gemma 4 generate text autoregressively, producing one token after another from left to right, DiffusionGemma uses discrete diffusion. It starts from a masked or noisy canvas of 256 tokens and refines all of them in parallel across several denoising steps, generating a full 256-token block per forward pass instead of a single token.
The model was created by Google DeepMind research scientists Brendan O’Donoghue and Sebastian Flennerhag, who applied the lab’s earlier Gemini Diffusion research to the Gemma 4 backbone. The result is a model that keeps the open Gemma 4 architecture but swaps the generation method, trading a small amount of quality for a large gain in generation speed on tasks where low latency matters.
DiffusionGemma benchmarks and technical specs
DiffusionGemma 26B-A4B has 25.2 billion total parameters but activates only 3.8 billion parameters per token through its Mixture-of-Experts design. It supports a 256K token context window, handles more than 140 languages, and accepts interleaved text, image, and video inputs while producing text output. The knowledge cutoff is January 2025.
On generation speed, Google reports more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on a consumer NVIDIA GeForce RTX 5090, with each forward pass emitting 256 tokens. Quantized to roughly 18GB of VRAM with only 3.8 billion active parameters, the model is designed to run on a single consumer GPU rather than a server cluster.
How does DiffusionGemma compare to standard Gemma 4 26B-A4B?
DiffusionGemma is faster but scores lower than the standard autoregressive Gemma 4 26B-A4B across the benchmarks Google published. On MMLU Pro it reaches 77.6 percent versus 82.6 percent for Gemma 4, on LiveCodeBench v6 it scores 69.1 percent versus 77.1 percent, on GPQA Diamond 73.2 percent versus 82.3 percent, and on Codeforces it records an Elo of 1429 versus 1718. The gap is consistent: the diffusion method costs a few points of accuracy in exchange for its speed advantage.
Google is explicit about the trade-off and recommends standard Gemma 4 for applications that need maximum quality, positioning DiffusionGemma as the option for latency-sensitive work rather than a replacement for the full Gemma 4 lineup. Because both models share the same 26B-A4B architecture and 256K context window, teams can move between them without re-engineering their pipelines.
What is DiffusionGemma 26B-A4B best used for?
DiffusionGemma targets speed-critical, interactive workflows that run locally or at low concurrency. Google names in-line text editing, code infilling, markdown formatting, and amino acid sequence generation as fit-for-purpose use cases, all tasks where the model fills in or rewrites a bounded block of text and where parallel 256-token generation gives the clearest benefit.
Because it runs on a single consumer GPU and is licensed under Apache 2.0, DiffusionGemma is aimed at developers who want fast local inference without per-token API costs. For high-stakes reasoning or long-form generation where accuracy outweighs latency, Google continues to point users to standard Gemma 4.
DiffusionGemma availability, licensing and how to run it
DiffusionGemma 26B-A4B is available now as open weights under the Apache 2.0 license, which permits commercial use. The instruction-tuned weights are published on Hugging Face as google/diffusiongemma-26B-A4B-it, and the model can be run locally on a single GPU after quantization to around 18GB of VRAM.
Full details are available on the official Google Blog on DiffusionGemma.