Google shipped Gemma 4 12B on June 3, 2026. It’s an open-weights model with 11.95 billion parameters that reads text, images, audio, and video, and it fits on a laptop with 16GB of memory. The headline detail: it’s the first mid-sized model with native audio input, and it does this with no separate vision or audio encoder.
That last part is what makes it different. Most multimodal models bolt a vision encoder and an audio encoder onto a language model. Gemma 4 12B drops both and feeds raw image patches and audio waveforms straight into the model. You get a single 12B file that handles four input types, runs offline, and ships under an Apache 2.0 license you can use commercially.
Here’s what the model is, where it sits in the Gemma 4 family, and what you can build with it. If you want to run it today, jump to the companion guide on how to use Gemma 4 12B for free.
Gemma 4 12B at a glance
| Spec | Value |
|---|---|
| Released | June 3, 2026 |
| Parameters | 11.95B (dense) |
| Inputs | Text, image, audio, video |
| Output | Text |
| Context window | 256K tokens |
| Architecture | Encoder-free unified multimodal |
| License | Apache 2.0 |
| Runs on | 16GB VRAM or unified memory (about 8GB at 4-bit) |
| Variants | google/gemma-4-12B (base), google/gemma-4-12B-it (instruction-tuned) |
The short answer
Gemma 4 12B is a dense, 12-billion-parameter open model from Google DeepMind that takes text, images, audio, and video as input and returns text. It’s tuned to run locally on consumer hardware, with a 256K-token context window, native tool calling, and an optional step-by-step reasoning mode.

It sits in the middle of the Gemma 4 lineup. Google describes it as the bridge between the edge-friendly E4B model and the larger 26B Mixture-of-Experts model, with quality that approaches the 26B on several benchmarks at less than half the memory footprint.
Where the 12B fits in the Gemma 4 family
Gemma 4 didn’t launch all at once. The E2B, E4B, 26B, and 31B models arrived on March 31, 2026. The 12B is the newest member, added on June 3. Here’s the full line:
| Model | Size | Context | Notes |
|---|---|---|---|
| Gemma 4 E2B | 2.3B effective (5.1B raw) | 128K | On-device, audio input |
| Gemma 4 E4B | 4.5B effective (8B raw) | 128K | Compact, audio input |
| Gemma 4 12B | 11.95B dense | 256K | Encoder-free, audio input |
| Gemma 4 26B A4B | 4B active / 26B total (MoE) | 256K | Mixture-of-experts |
| Gemma 4 31B | 31B dense | 256K | Frontier performance |
The 12B is the only model in the family built on the encoder-free design. The others keep a traditional vision encoder (and a conformer audio encoder on the smaller two). That makes the 12B the cleanest demonstration of where Google is taking on-device multimodal AI.
For context on how these stack up against other open models, see our comparison of MiniMax M3, DeepSeek V4, and Qwen 3.7 and the wider open-weight price war.
What “encoder-free” actually means
Standard multimodal models work in two stages. A vision encoder turns an image into embeddings, an audio encoder turns sound into embeddings, and then a projector maps those into the language model’s space. That’s three components to load, tune, and keep in memory.
Gemma 4 12B removes the encoders. According to Google’s writeup:
- Vision: a lightweight embedding module (a single matrix multiplication plus positional embeddings and normalization) projects raw image patches directly into the model’s embedding space.
- Audio: the audio encoder is gone. Raw audio is projected into the same dimensional space as text tokens, so sound and words share one pathway.
The vision and audio inputs flow straight into the language model backbone. One model, one set of weights, every modality treated as tokens.
Two more architecture choices keep it efficient on small hardware:
- Per-layer embeddings (PLE): each decoder layer gets a small dedicated embedding that mixes a token-identity lookup with a context-aware projection. This cuts parameter cost while letting layers specialize.
- Shared KV cache: the last several layers reuse key-value tensors from earlier layers instead of computing their own. That trims memory during long-context and on-device runs with little quality cost.
Google also ships a Multi-Token Prediction (MTP) drafter for speculative decoding, which can speed end-to-end inference by up to roughly 3x with no change to output quality.
Native audio and full multimodality
Plenty of open models read images. Gemma 4 12B is the first mid-sized one to take audio natively, in the same model that handles text and vision. That opens a different class of work:
- Automatic speech recognition and transcription
- Speaker diarization (who spoke when)
- Audio question answering over non-speech sounds
- Video understanding, with audio, not just frames
- Image tasks: captioning, object and UI detection, visual reasoning
Input order matters when you mix modalities. The chat template expects image content before the text prompt and audio after it. The model returns text in every case.
How Gemma 4 12B performs
These are the published scores for the instruction-tuned gemma-4-12B-it, from the Hugging Face model card:
| Benchmark | Gemma 4 12B-it |
|---|---|
| MMLU Pro (reasoning) | 77.2% |
| AIME 2026 (math, no tools) | 77.5% |
| GPQA Diamond (science) | 78.8% |
| LiveCodeBench v6 (coding) | 72.0% |
| Codeforces (ELO) | 1659 |
| MMMU Pro (vision) | 69.1% |
| MATH-Vision | 79.7% |
| MRCR v2, 128K, 8-needle (long context) | 43.4% |
To put that in the family context, here’s how the 12B lands between its neighbors on a few headline tests:
| Benchmark | E4B | 12B | 26B A4B | 31B |
|---|---|---|---|---|
| MMLU Pro | 69.4% | 77.2% | 82.6% | 85.2% |
| AIME 2026 | 42.5% | 77.5% | 88.3% | 89.2% |
| GPQA Diamond | 58.6% | 78.8% | 82.3% | 84.3% |
| LiveCodeBench v6 | 52.0% | 72.0% | 77.1% | 80.0% |
The pattern is clear. The 12B sits well above the 4B-class E4B and within reach of the 26B MoE, which is the trade Google is pitching: most of the bigger model’s quality, on a machine you already own.
What’s new versus Gemma 3
If you used Gemma 3, four things stand out:
- Native audio. Gemma 3 was text and vision. The 12B adds sound and video-with-audio in the base model.
- The encoder-free design. No bolt-on vision or audio encoder to load.
- 256K context. Four times the headroom for long documents, transcripts, and multi-file code.
- Apache 2.0. Earlier Gemma releases used a custom Gemma license with use restrictions. Gemma 4 moves to standard Apache 2.0, which is simpler for commercial and redistribution use.
What you can build with it
The 12B is aimed at work that runs on the device, not in the cloud:
- Offline assistants that see your screen and hear your mic without sending data out
- Meeting and call tools that transcribe, diarize, and summarize locally
- Document and media pipelines that mix PDFs, screenshots, and audio in one prompt
- Agentic workflows: it supports function calling and tool use, so it can plan and act
- Coding help at a 72.0% LiveCodeBench level, usable for local autocomplete and refactors
Because it exposes a standard chat interface through runners like Ollama and llama.cpp, you can point existing tools at it. When you wire a local model into an app, you still want to confirm the request and response shape. A tool like Apidog lets you save the local endpoint, send sample prompts, and check the JSON before you build on top of it. You can download Apidog free and aim it at the local server in a minute. More on that in the free usage guide.
License and what Apache 2.0 gives you
Gemma 4 12B is released under Apache 2.0. In plain terms:
- You can use it commercially.
- You can modify, fine-tune, and redistribute it.
- You can run it in closed-source products.
- You keep your outputs.
This is a real shift from the earlier Gemma license, which carried Google’s own use-policy terms. Apache 2.0 is the same permissive license behind a long list of open infrastructure, so legal review tends to be quick.
Hardware you need
Google’s target is a 16GB machine, VRAM or Apple-style unified memory. Quantization brings that down:
- Full quality: around 16GB
- 8-bit: roughly 14GB
- 4-bit (Q4_K_M): about 8GB, the default in Ollama
That puts the 12B within reach of a mainstream gaming GPU, a 16GB MacBook, or a mid-range workstation. The smaller E2B and E4B models go lower still if your hardware is tight.
Limitations worth knowing
Google is direct about the trade-offs in the model card:
- It can produce incorrect or outdated facts; verify anything important.
- It can reflect biases in its training data.
- It handles sarcasm, nuance, and figurative language unevenly.
- Common-sense reasoning has limits, like any model this size.
- Output quality depends on prompt clarity and the context you give it.
These are the normal caveats for a 12B open model. It won’t replace a frontier cloud model for the hardest reasoning, but that isn’t the point. The point is capable multimodal AI that runs where your data already lives.
FAQ
Is Gemma 4 12B free? Yes. The weights are open under Apache 2.0 and free to download from Hugging Face and Kaggle. You only pay for the hardware or cloud you run it on. See how to use Gemma 4 12B for free.
Can Gemma 4 12B really understand audio? Yes. It takes raw audio as input and can transcribe speech, identify speakers, and answer questions about sound. It’s the first mid-sized model to do this natively rather than through a separate speech model.
What’s the difference between gemma-4-12B and gemma-4-12B-it? The base model is pretrained only. The -it version is instruction-tuned for chat, tool use, and following directions. Most people want the -it build.
How is the 12B different from the 26B and 31B? The 12B is dense and encoder-free, tuned for 16GB machines. The 26B is a Mixture-of-Experts model (4B active, 26B total), and the 31B is a larger dense model for frontier quality. Both bigger models score higher on benchmarks but need more memory.
Does Gemma 4 12B support tool calling? Yes. It supports text and multimodal function calling, plus an optional thinking mode for step-by-step reasoning, which makes it usable for agentic workflows.
How does it compare to Gemini 3.5? Different jobs. Gemini 3.5 is Google’s hosted frontier model; see what is Gemini 3.5. Gemma 4 12B is an open model you run yourself. You trade some peak quality for privacy, offline use, and zero per-token cost.



