Google releases Gemma 4 – 12B

04-06-2026

Gemma 4 12B is an open-weight, encoder-free multimodal model with a 256,000-token context window that processes text, images, audio, and video on a 16GB laptop.

Written by:

Jorick van Weelie

Marketing Lead at DataNorth | AI Enthusiast & Tech Storyteller

google releases gemma 4 12b Sign up for our Newsletter

June 4, 2026

Google released Gemma 4 12B on June 3, 2026, a new open-weight multimodal model that processes text, images, audio, and video and runs entirely on a typical 16GB laptop. Gemma 4 12B uses an encoder-free architecture, supports a 256,000-token context window, and ships with open weights under the Apache 2.0 license on Hugging Face. It is the first mid-sized Gemma model with native audio input.

What can Gemma 4 12B do?

Gemma 4 12B is a 12-billion-parameter open-weight model that accepts text, images, audio, and video as input. It is the first mid-sized model in the Gemma family to support native audio, which means it can process raw speech and sound without a separate transcription step. The 256,000-token context window lets the model work across long documents, large codebases, and multi-step agentic workflows in a single pass.

The model is designed to run locally. Gemma 4 12B fits on a system with roughly 16GB of VRAM or unified memory, which covers many current Windows laptops and Apple MacBook configurations. This makes it possible to run a multimodal model on a single machine without sending data to a cloud service, which is relevant for privacy-sensitive workloads and offline use.

How does the encoder-free architecture work?

Most multimodal models attach separate vision and audio encoders to the language model. Gemma 4 12B removes these encoders and projects raw inputs directly into the language model’s embedding space through lightweight linear layers. For vision, the model uses an embedder of around 35 million parameters that splits images into 48 by 48 pixel patches and projects each patch with a single matrix multiplication plus factorized X and Y positional lookups.

For audio, Gemma 4 12B slices a 16 kHz signal into 40 millisecond frames of 640 values each and projects them linearly into token space. Removing the dedicated encoders reduces the memory footprint and is the main reason the model fits on consumer hardware while still handling four input types.

Gemma 4 12B benchmarks and technical specs

Gemma 4 12B scores 94.9% on DocVQA and 88.4% on InfoVQA, two benchmarks for reading and answering questions about documents and infographics. On MMMU Pro, a multimodal reasoning benchmark, it reaches 69.1%. On text-based reasoning, the model scores 77.2% on MMLU Pro and 78.8% on GPQA Diamond. On mathematics, it reaches 77.5% on AIME 2026 and 79.7% on MATH-Vision.

According to Google, Gemma 4 12B performs close to the larger Gemma 4 26B mixture-of-experts model on standard benchmarks while using less than half the total memory. The model has a 256,000-token context window and a dense 12-billion-parameter design, and the open weights are published in both a base and an instruction-tuned variant.

Gemma 4 12B availability, license, and pricing

Gemma 4 12B is available now with open weights under the Apache 2.0 license, which permits commercial use. The weights are published on Hugging Face as google/gemma-4-12B and google/gemma-4-12B-it, and the model is accessible through Google AI for Developers and local runtimes such as LM Studio. Because the model runs locally, there is no per-token API cost for self-hosted deployment.

The instruction-tuned variant (gemma-4-12B-it) is intended for chat and assistant use, while the base variant is intended for fine-tuning. Both can run on a single 16GB device, which lowers the hardware barrier compared with models that require dedicated server GPUs.

How does Gemma 4 12B compare to earlier Gemma models?

Gemma 4 12B succeeds the Gemma 3 generation and adds native audio input, video understanding, and the encoder-free architecture. The headline change is efficiency: the model targets frontier-level multimodal performance at a size that fits on a laptop, where earlier open multimodal models of similar capability typically required more memory or separate encoders.

Within Google’s own line-up, Gemma 4 12B sits below the larger Gemma 4 26B mixture-of-experts model and the proprietary Gemini 3.5 family. The distinction is that Gemma is open-weight and built to run on local hardware, while Gemini is delivered as a managed API. For developers who need an on-device multimodal model with a permissive license, Gemma 4 12B is the relevant option.

For full details on Gemma 4 12B, including the architecture overview and benchmark methodology, see the official announcement on the Google blog.