[TOOLS] 6 min readOraCore Editors

Gemma 4 assistant models get faster draft tokens

Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Share LinkedIn
Gemma 4 assistant models get faster draft tokens

Gemma 4 assistant models use centroid masking to speed up draft-token generation.

Google’s Gemma 4 assistant checkpoints now have a practical trick for speculative decoding: they shrink the candidate token set from roughly 262,000 vocabulary entries to about 4,000 centroids. That turns a huge dot product into a much smaller selection step, and vLLM says the result is about a 45x reduction in lm_head compute with little effect on draft token quality.

ItemValueWhy it matters
Full vocabulary size~262K tokensBaseline cost for the original dot product
Centroid candidate set~4K tokensMuch smaller pool for draft-token selection
Compute reduction~45xLess work in lm_head
Example server commandvllm serve google/gemma-4-31B-it ...Shows how to run the model with speculative decoding
Max model length8192Sets the context window in the recipe
Speculative tokens4Number of draft tokens requested per step

What centroid masking changes

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The interesting part here is not the model size. It is the way the assistant model predicts tokens. In a standard setup, the model scores a very large vocabulary, then picks the next token from that distribution. Gemma 4’s E2B and E4B assistant models use centroid masking to skip most of that work and focus on a small set of candidate tokens.

Gemma 4 assistant models get faster draft tokens

This matters because speculative decoding only pays off if the draft model is cheap enough and accurate enough. If the assistant model spends too much time scoring tokens, the speedup gets eaten by overhead. If it is too approximate, the main model rejects too many draft tokens and the whole system slows down. The centroid approach tries to keep both sides in check.

  • Full vocabulary scoring: about 262,000 tokens
  • Centroid candidate set: about 4,000 tokens
  • Reported compute drop: about 45x in lm_head
  • Centroid masking activates automatically when the checkpoint includes ordered embeddings

Why vLLM users should care

For people running vLLM, the practical win is that the optimization is automatic. The recipe says centroid masking turns on when the assistant checkpoint includes the centroid weights, via use_ordered_embeddings: true. There is no extra tuning step and no special runtime flag to hunt for.

That makes this easier to adopt than a lot of inference tricks that need custom kernels, hidden environment variables, or a matching model fork. If you already serve Gemma 4 with speculative decoding, you get a faster assistant path without changing your deployment playbook.

“Speculative decoding can significantly accelerate generation when the draft model is much cheaper than the target model.” — Yaniv Leviathan, Matan Kalman, and Yossi Matias

The quote above comes from the original speculative decoding paper, which explains the core tradeoff behind this recipe. Gemma 4’s centroid masking is one more way to make the draft model cheaper while keeping its guesses useful.

The server command in context

The recipe uses a concrete vLLM serve example for google/gemma-4-31B-it with two tensor-parallel workers, an 8,192-token context window, and four speculative tokens per step. It also points to the assistant checkpoint gg-hf-am/gemma-4-31B-it-assistant, which is where the centroid weights live.

Gemma 4 assistant models get faster draft tokens
  • --tensor-parallel-size 2 splits the model across two workers
  • --max-model-len 8192 caps the context window at 8,192 tokens
  • --speculative-config points to the assistant checkpoint and sets num_speculative_tokens to 4
  • The assistant checkpoint must include centroid weights for automatic masking

That command tells a clear story: the optimization is meant for real serving setups, not toy benchmarks. It is tuned for operators who care about throughput, latency, and how much compute gets burned before the main model even sees the candidate tokens.

How this compares with a plain draft model

A normal assistant model still has to score a large vocabulary, so the cheap part of speculative decoding is not always that cheap. Centroid masking trims that cost by restricting the search space. The recipe’s numbers make the tradeoff easy to read: roughly 262K possible tokens become about 4K candidates, and the compute drops by about 45x.

That kind of reduction does not guarantee a 45x end-to-end speedup, because the main model still does the final verification. But it does remove one of the biggest bottlenecks in the draft path. For teams already using Hugging Face checkpoints, the appeal is obvious: better draft efficiency without a custom inference stack.

  • Plain draft path: full vocabulary scoring every step
  • Gemma 4 assistant path: centroid-based candidate filtering
  • Operational result: lower draft overhead before verification
  • Adoption path: automatic when the checkpoint ships ordered embeddings

What to watch next

The main question is whether this pattern spreads beyond Gemma 4 assistant models. If more checkpoints ship centroid weights by default, speculative decoding gets easier to justify in production. If not, this stays a useful optimization for a narrow set of deployments.

For now, the takeaway is simple: if you run Gemma 4 in vLLM and care about token throughput, check that you are using the assistant checkpoint with ordered embeddings. If you are not, you are leaving a large chunk of draft-side efficiency on the table.