Gemma 4 assistant models get faster draft tokens

OraCore Editors

Back to home

[TOOLS] May 9, 20266 min readOraCore Editors

Gemma 4 assistant models get faster draft tokens

Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Gemma 4 speculative decoding vLLM assistant model centroid masking

Share LinkedIn

Gemma 4 assistant models get faster draft tokens

Gemma 4 assistant models use centroid masking to speed up draft-token generation.

Google’s Gemma 4 assistant checkpoints now have a practical trick for speculative decoding: they shrink the candidate token set from roughly 262,000 vocabulary entries to about 4,000 centroids. That turns a huge dot product into a much smaller selection step, and vLLM says the result is about a 45x reduction in lm_head compute with little effect on draft token quality.

Item	Value	Why it matters
Full vocabulary size	~262K tokens	Baseline cost for the original dot product
Centroid candidate set	~4K tokens	Much smaller pool for draft-token selection
Compute reduction	~45x	Less work in `lm_head`
Example server command	`vllm serve google/gemma-4-31B-it ...`	Shows how to run the model with speculative decoding
Max model length	8192	Sets the context window in the recipe
Speculative tokens	4	Number of draft tokens requested per step

What centroid masking changes

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The interesting part here is not the model size. It is the way the assistant model predicts tokens. In a standard setup, the model scores a very large vocabulary, then picks the next token from that distribution. Gemma 4’s E2B and E4B assistant models use centroid masking to skip most of that work and focus on a small set of candidate tokens.

This matters because speculative decoding only pays off if the draft model is cheap enough and accurate enough. If the assistant model spends too much time scoring tokens, the speedup gets eaten by overhead. If it is too approximate, the main model rejects too many draft tokens and the whole system slows down. The centroid approach tries to keep both sides in check.

Full vocabulary scoring: about 262,000 tokens
Centroid candidate set: about 4,000 tokens
Reported compute drop: about 45x in lm_head
Centroid masking activates automatically when the checkpoint includes ordered embeddings

Why vLLM users should care

For people running vLLM, the practical win is that the optimization is automatic. The recipe says centroid masking turns on when the assistant checkpoint includes the centroid weights, via use_ordered_embeddings: true. There is no extra tuning step and no special runtime flag to hunt for.

That makes this easier to adopt than a lot of inference tricks that need custom kernels, hidden environment variables, or a matching model fork. If you already serve Gemma 4 with speculative decoding, you get a faster assistant path without changing your deployment playbook.

“Speculative decoding can significantly accelerate generation when the draft model is much cheaper than the target model.” — Yaniv Leviathan, Matan Kalman, and Yossi Matias

The quote above comes from the original speculative decoding paper, which explains the core tradeoff behind this recipe. Gemma 4’s centroid masking is one more way to make the draft model cheaper while keeping its guesses useful.

The server command in context

The recipe uses a concrete vLLM serve example for google/gemma-4-31B-it with two tensor-parallel workers, an 8,192-token context window, and four speculative tokens per step. It also points to the assistant checkpoint gg-hf-am/gemma-4-31B-it-assistant, which is where the centroid weights live.

--tensor-parallel-size 2 splits the model across two workers
--max-model-len 8192 caps the context window at 8,192 tokens
--speculative-config points to the assistant checkpoint and sets num_speculative_tokens to 4
The assistant checkpoint must include centroid weights for automatic masking

That command tells a clear story: the optimization is meant for real serving setups, not toy benchmarks. It is tuned for operators who care about throughput, latency, and how much compute gets burned before the main model even sees the candidate tokens.

How this compares with a plain draft model

A normal assistant model still has to score a large vocabulary, so the cheap part of speculative decoding is not always that cheap. Centroid masking trims that cost by restricting the search space. The recipe’s numbers make the tradeoff easy to read: roughly 262K possible tokens become about 4K candidates, and the compute drops by about 45x.

That kind of reduction does not guarantee a 45x end-to-end speedup, because the main model still does the final verification. But it does remove one of the biggest bottlenecks in the draft path. For teams already using Hugging Face checkpoints, the appeal is obvious: better draft efficiency without a custom inference stack.

Plain draft path: full vocabulary scoring every step
Gemma 4 assistant path: centroid-based candidate filtering
Operational result: lower draft overhead before verification
Adoption path: automatic when the checkpoint ships ordered embeddings

What to watch next

The main question is whether this pattern spreads beyond Gemma 4 assistant models. If more checkpoints ship centroid weights by default, speculative decoding gets easier to justify in production. If not, this stays a useful optimization for a narrow set of deployments.

For now, the takeaway is simple: if you run Gemma 4 in vLLM and care about token throughput, check that you are using the assistant checkpoint with ordered embeddings. If you are not, you are leaving a large chunk of draft-side efficiency on the table.

// Related Articles

Gemma 4 assistant models get faster draft tokens

What centroid masking changes

Get the latest AI news in your inbox

Why vLLM users should care

The server command in context

How this compares with a plain draft model

What to watch next

Why VidHub 会员互通不是“买一次全设备通用”

Why Bun’s Zig-to-Rust experiment is the right move

Why OpenAI API pricing is a product strategy, not a footnote

Why Claude Code’s prompt design beats IDE copilots

Why Databricks Model Serving is the right default for production infe…

Why IBM’s Bob is the right kind of AI coding assistant