Gemma 4 brings 256K context to open models

OraCore Editors

[MODEL] June 17, 20267 min readOraCore Editors

Gemma 4 brings 256K context to open models

Google’s Gemma 4 adds text, image, and audio input, plus up to 256K context and five model sizes for local or server use.

Gemma 4 multimodal AI long context Google DeepMind

Share LinkedIn

Gemma 4 brings 256K context to open models

Google’s Gemma 4 adds multimodal input, 256K context, and five open-weight model sizes.

Google DeepMind has updated Gemma with a fourth-generation model family that can read text, images, and, in some sizes, audio. The headline number is the context window: up to 256,000 tokens, which puts long-document work and multi-turn agent tasks in a much more practical range.

The release is split across five sizes, from E2B and E4B for on-device and edge deployments up to 12B, 26B A4B, and 31B for heavier workloads. Google also says the models ship as open weights in both pre-trained and instruction-tuned forms, under an Apache 2.0 license.

Model	Params	Context	Modalities	Notes
E2B	2.3B effective, 5.1B with embeddings	128K	Text, image, audio	Designed for efficient on-device use
E4B	4.5B effective, 8B with embeddings	128K	Text, image, audio	Small model with audio support
12B Unified	11.95B	256K	Text, image, audio	Decoder-only unified design
26B A4B MoE	25.2B total, 3.8B active	256K	Text, image	Mixture-of-experts model
31B	30.7B	256K	Text, image	Largest dense model in the family

Gemma 4 is built for long context and mixed inputs

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Gemma 4 is not a single model with one deployment target. It is a family, and the split matters. The smaller E2B and E4B models are aimed at devices that need speed and lower memory use, while the 12B, 26B A4B, and 31B models are meant for stronger GPUs, workstations, and server-side inference.

That spread makes Gemma 4 more useful than a one-size-fits-all release. A mobile assistant, a desktop coding tool, and a document analysis service do not need the same tradeoff between latency, memory, and quality. Google is trying to cover all three with the same model line.

One practical detail is the context window. The smaller models use 128K tokens, and the mid and larger models go to 256K. That is enough room for long reports, large codebases, or many-turn conversations without chopping the input into tiny pieces.

E2B and E4B support text, image, and audio.
12B Unified supports text, image, and audio without separate encoders.
26B A4B and 31B focus on text and image, with 256K context.
All five sizes are open-weight releases.

The architecture choices are doing real work

Google’s documentation says Gemma 4 uses a mix of dense and mixture-of-experts designs, plus a hybrid attention scheme that alternates local sliding-window attention with global attention. That is the kind of engineering detail that usually decides whether a model feels fast in practice or just looks impressive on a chart.

The 26B A4B model is the clearest example. It has 25.2B total parameters, but only 3.8B active parameters during inference. That means the model can behave more like a smaller system at runtime while still keeping the capacity of a much larger one.

The smaller models also use per-layer embeddings, which Google says improve parameter efficiency for on-device deployment. In plain English: the model family is trying to save memory where it matters most, without stripping out the features developers actually want.

“The future of AI is open,” said Demis Hassabis, co-founder and CEO of Google DeepMind, in a 2024 blog post announcing Gemma.

That line matters here because Gemma 4 keeps Google’s open-weight story alive while adding capabilities that used to be reserved for bigger proprietary systems. The company is clearly betting that developers want models they can inspect, tune, and ship in more places.

The benchmarks show strength, but the spread is the story

Google’s benchmark table is broad, and the numbers show a family with clear tiers. The 31B model leads most of the pack, but the 26B A4B model often gets close while using far fewer active parameters. That is exactly the kind of tradeoff teams care about when they are paying for inference.

Here are a few of the more telling results from the instruction-tuned models:

MMLU Pro: 85.2% for 31B, 82.6% for 26B A4B, 77.2% for 12B Unified.
LiveCodeBench v6: 80.0% for 31B, 77.1% for 26B A4B, 72.0% for 12B Unified.
Codeforces Elo: 2150 for 31B, 1718 for 26B A4B, 1659 for 12B Unified.
MRCR v2 at 128K: 66.4% for 31B, 44.1% for 26B A4B, 43.4% for 12B Unified.

The coding and reasoning numbers are especially interesting. A 2150 Codeforces Elo is a serious result for a general-purpose model family, and the jump from 1659 to 1718 on the 26B A4B model suggests the MoE design is doing more than saving compute on paper.

There is also a visible drop-off in long-context retrieval as you move down the stack. That is normal, but it is the number to watch if you plan to stuff entire docs, transcripts, or repos into the prompt.

What developers can actually build with it

Gemma 4 is aimed at more than chat. Google highlights text generation, programming, reasoning, function calling, and multimodal understanding. The model family also adds built-in support for the system role, which makes structured prompting and agent-style workflows easier to manage.

That matters because a lot of model releases talk about “agents” without making the plumbing any better. Here, the combination of function calling, long context, and system prompt support gives teams a cleaner base for assistants that need to read, decide, and act.

If you are comparing it with other open-weight options, the practical question is deployment fit. Smaller Gemma 4 models are a better match for local apps and edge devices, while the larger ones make more sense for server-side tools that need stronger reasoning and document handling. For teams already using Hugging Face, Ollama, or LM Studio, the open-weight format lowers the friction of testing these models in real workflows.

Google also points developers to its broader ecosystem, including Google Developers Blog, Google AI, and Vertex AI. That gives Gemma 4 a straightforward path from local experiments to managed deployment.

Gemma 4 is a practical release, not a novelty drop

The most interesting thing about Gemma 4 is that Google did not optimize for a single headline metric. It built a family with real deployment variety, long context, multimodal input, and enough benchmark strength to matter in production conversations.

If you are building a document assistant, a coding copilot, or a multimodal agent, the first question is no longer whether an open model can handle the workload. The question is which Gemma 4 size fits your latency budget, memory ceiling, and context length.

My bet: the 26B A4B model will get the most attention from developers who want strong results without paying full dense-model costs, while the 12B Unified model will be the sleeper choice for teams that care about multimodal input and simpler architecture. The next thing to watch is whether third-party tooling catches up fast enough to make those choices easy to test outside Google’s own stack.

// Related Articles

Gemma 4 brings 256K context to open models

Gemma 4 is built for long context and mixed inputs

Get the latest AI news in your inbox

The architecture choices are doing real work

The benchmarks show strength, but the spread is the story

What developers can actually build with it

Gemma 4 is a practical release, not a novelty drop

Kimi K2.7 Code 该优先上 API 和 Kimi Code，而不是等生态成熟

Kingdom Hearts IV confirmed for Switch 2 launch

Gemini 3.5 Live Translate rolls out in 70+ languages

OpenAI’s 5.6 model hints at a bigger jump

GLM-5.2把前沿模型变成可用工具

OpenAI files IPO paperwork as scrutiny grows