[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-gemma-4-assistant-models-faster-draft-tokens-en":3,"tags-gemma-4-assistant-models-faster-draft-tokens-en":34,"related-lang-gemma-4-assistant-models-faster-draft-tokens-en":45,"related-posts-gemma-4-assistant-models-faster-draft-tokens-en":49,"series-tools-6dcd6852-b95a-4f62-853a-cc7eb32fff1a":86},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":30,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"6dcd6852-b95a-4f62-853a-cc7eb32fff1a","Gemma 4 assistant models get faster draft tokens","\u003Cp data-speakable=\"summary\">Gemma 4 assistant models use centroid masking to speed up draft-\u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> generation.\u003C\u002Fp>\u003Cp>\u003Ca href=\"\u002Ftag\u002Fgoogle\">Google\u003C\u002Fa>’s \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemma\" target=\"_blank\" rel=\"noopener\">Gemma\u003C\u002Fa> 4 assistant checkpoints now have a practical trick for speculative decoding: they shrink the candidate token set from roughly 262,000 vocabulary entries to about 4,000 centroids. That turns a huge dot product into a much smaller selection step, and \u003Ca href=\"https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002F\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa> says the result is about a 45x reduction in \u003Ccode>lm_head\u003C\u002Fcode> compute with little effect on draft token quality.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Item\u003C\u002Fth>\u003Cth>Value\u003C\u002Fth>\u003Cth>Why it matters\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Full vocabulary size\u003C\u002Ftd>\u003Ctd>~262K tokens\u003C\u002Ftd>\u003Ctd>Baseline cost for the original dot product\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Centroid candidate set\u003C\u002Ftd>\u003Ctd>~4K tokens\u003C\u002Ftd>\u003Ctd>Much smaller pool for draft-token selection\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Compute reduction\u003C\u002Ftd>\u003Ctd>~45x\u003C\u002Ftd>\u003Ctd>Less work in \u003Ccode>lm_head\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Example server command\u003C\u002Ftd>\u003Ctd>\u003Ccode>vllm serve google\u002Fgemma-4-31B-it ...\u003C\u002Fcode>\u003C\u002Ftd>\u003Ctd>Shows how to run the model with speculative decoding\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Max model length\u003C\u002Ftd>\u003Ctd>8192\u003C\u002Ftd>\u003Ctd>Sets the context window in the recipe\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Speculative tokens\u003C\u002Ftd>\u003Ctd>4\u003C\u002Ftd>\u003Ctd>Number of draft tokens requested per step\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>What centroid masking changes\u003C\u002Fh2>\u003Cp>The interesting part here is not the model size. It is the way the assistant model predicts tokens. In a standard setup, the model scores a very large vocabulary, then picks the next token from that distribution. Gemma 4’s E2B and E4B assistant models use centroid masking to skip most of that work and focus on a small set of candidate tokens.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278254841-r19z.png\" alt=\"Gemma 4 assistant models get faster draft tokens\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>This matters because speculative decoding only pays off if the draft model is cheap enough and accurate enough. If the assistant model spends too much time scoring tokens, the speedup gets eaten by overhead. If it is too approximate, the main model rejects too many draft tokens and the whole system slows down. The centroid approach tries to keep both sides in check.\u003C\u002Fp>\u003Cul>\u003Cli>Full vocabulary scoring: about 262,000 tokens\u003C\u002Fli>\u003Cli>Centroid candidate set: about 4,000 tokens\u003C\u002Fli>\u003Cli>Reported compute drop: about 45x in \u003Ccode>lm_head\u003C\u002Fcode>\u003C\u002Fli>\u003Cli>Centroid masking activates automatically when the checkpoint includes ordered embeddings\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why vLLM users should care\u003C\u002Fh2>\u003Cp>For people running \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa>, the practical win is that the optimization is automatic. The recipe says centroid masking turns on when the assistant checkpoint includes the centroid weights, via \u003Ccode>use_ordered_embeddings: true\u003C\u002Fcode>. There is no extra tuning step and no special runtime flag to hunt for.\u003C\u002Fp>\u003Cp>That makes this easier to adopt than a lot of inference tricks that need custom kernels, hidden environment variables, or a matching model fork. If you already serve Gemma 4 with speculative decoding, you get a faster assistant path without changing your deployment playbook.\u003C\u002Fp>\u003Cblockquote>“Speculative decoding can significantly accelerate generation when the draft model is much cheaper than the target model.” — \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.17192\" target=\"_blank\" rel=\"noopener\">Yaniv Leviathan, Matan Kalman, and Yossi Matias\u003C\u002Fa>\u003C\u002Fblockquote>\u003Cp>The quote above comes from the original speculative decoding paper, which explains the core tradeoff behind this recipe. Gemma 4’s centroid masking is one more way to make the draft model cheaper while keeping its guesses useful.\u003C\u002Fp>\u003Ch2>The server command in context\u003C\u002Fh2>\u003Cp>The recipe uses a concrete \u003Ca href=\"https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fserving\u002Fengine_args.html\" target=\"_blank\" rel=\"noopener\">vLLM serve\u003C\u002Fa> example for \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-4-31B-it\" target=\"_blank\" rel=\"noopener\">google\u002Fgemma-4-31B-it\u003C\u002Fa> with two tensor-parallel workers, an 8,192-token context window, and four speculative tokens per step. It also points to the assistant checkpoint \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fgg-hf-am\u002Fgemma-4-31B-it-assistant\" target=\"_blank\" rel=\"noopener\">gg-hf-am\u002Fgemma-4-31B-it-assistant\u003C\u002Fa>, which is where the centroid weights live.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278260075-gc0i.png\" alt=\"Gemma 4 assistant models get faster draft tokens\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cul>\u003Cli>\u003Ccode>--tensor-parallel-size 2\u003C\u002Fcode> splits the model across two workers\u003C\u002Fli>\u003Cli>\u003Ccode>--max-model-len 8192\u003C\u002Fcode> caps the context window at 8,192 tokens\u003C\u002Fli>\u003Cli>\u003Ccode>--speculative-config\u003C\u002Fcode> points to the assistant checkpoint and sets \u003Ccode>num_speculative_tokens\u003C\u002Fcode> to 4\u003C\u002Fli>\u003Cli>The assistant checkpoint must include centroid weights for automatic masking\u003C\u002Fli>\u003C\u002Ful>\u003Cp>That command tells a clear story: the optimization is meant for real serving setups, not toy benchmarks. It is tuned for operators who care about throughput, latency, and how much compute gets burned before the main model even sees the candidate tokens.\u003C\u002Fp>\u003Ch2>How this compares with a plain draft model\u003C\u002Fh2>\u003Cp>A normal assistant model still has to score a large vocabulary, so the cheap part of speculative decoding is not always that cheap. Centroid masking trims that cost by restricting the search space. The recipe’s numbers make the tradeoff easy to read: roughly 262K possible tokens become about 4K candidates, and the compute drops by about 45x.\u003C\u002Fp>\u003Cp>That kind of reduction does not guarantee a 45x end-to-end speedup, because the main model still does the final verification. But it does remove one of the biggest bottlenecks in the draft path. For teams already using \u003Ca href=\"https:\u002F\u002Fhuggingface.co\" target=\"_blank\" rel=\"noopener\">Hugging Face\u003C\u002Fa> checkpoints, the appeal is obvious: better draft efficiency without a custom inference stack.\u003C\u002Fp>\u003Cul>\u003Cli>Plain draft path: full vocabulary scoring every step\u003C\u002Fli>\u003Cli>Gemma 4 assistant path: centroid-based candidate filtering\u003C\u002Fli>\u003Cli>Operational result: lower draft overhead before verification\u003C\u002Fli>\u003Cli>Adoption path: automatic when the checkpoint ships ordered embeddings\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What to watch next\u003C\u002Fh2>\u003Cp>The main question is whether this pattern spreads beyond Gemma 4 assistant models. If more checkpoints ship centroid weights by default, speculative decoding gets easier to justify in production. If not, this stays a useful optimization for a narrow set of deployments.\u003C\u002Fp>\u003Cp>For now, the takeaway is simple: if you run Gemma 4 in vLLM and care about token throughput, check that you are using the assistant checkpoint with ordered embeddings. If you are not, you are leaving a large chunk of draft-side efficiency on the table.\u003C\u002Fp>","Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.","docs.vllm.ai","https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Frecipes\u002Fen\u002Flatest\u002FGoogle\u002FGemma4.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278254841-r19z.png",[13,14,15,16,17],"Gemma 4","vLLM","speculative decoding","centroid masking","assistant model","en",4,false,"2026-05-08T22:10:34.02358+00:00","2026-05-08T22:10:34.007+00:00","done","a5d94ce6-4444-4f28-8e0c-b4fee6d43401","gemma-4-assistant-models-faster-draft-tokens-en","tools","fe630502-5455-4001-a6bf-0643f9eb469d","published","2026-05-09T09:00:14.513+00:00",[31,32,33],"Centroid masking cuts Gemma 4 assistant draft computation by about 45x.","The optimization is automatic when the checkpoint includes ordered embeddings.","The recipe uses vLLM with 2-way tensor parallelism, an 8,192-token context, and 4 speculative tokens.",[35,37,39,41,43],{"name":13,"slug":36},"gemma-4",{"name":15,"slug":38},"speculative-decoding",{"name":14,"slug":40},"vllm",{"name":17,"slug":42},"assistant-model",{"name":16,"slug":44},"centroid-masking",{"id":27,"slug":46,"title":47,"language":48},"gemma-4-assistant-models-faster-draft-tokens-zh","Gemma 4 助手模型加速草稿 Token","zh",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":26},"8b02abfa-eb16-4853-8b15-63d302c7b587","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-en","Why VidHub 会员互通不是“买一次全设备通用”","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789439875-uceq.png","2026-05-14T20:10:26.046635+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":26},"abe54a57-7461-4659-b2a0-99918dfd2a33","why-buns-zig-to-rust-experiment-is-right-en","Why Bun’s Zig-to-Rust experiment is the right move","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767895201-5745.png","2026-05-14T14:10:29.298057+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":26},"f0015918-251b-43d7-95af-032d2139f3f6","why-openai-api-pricing-is-product-strategy-en","Why OpenAI API pricing is a product strategy, not a footnote","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749841805-uyhg.png","2026-05-14T09:10:27.921211+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":26},"7096dab0-6d27-42d9-b951-7545a5dddf33","why-claude-code-prompt-design-beats-ide-copilots-en","Why Claude Code’s prompt design beats IDE copilots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742651754-3kxk.png","2026-05-14T07:10:30.953808+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":26},"1f1bff1e-0ebc-4fa7-a078-64dc4b552548","why-databricks-model-serving-is-right-default-en","Why Databricks Model Serving is the right default for production infe…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692290314-gopj.png","2026-05-13T17:10:32.167576+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":26},"029add1b-4386-4970-bd37-45809d6f7f2f","why-ibm-bob-right-kind-ai-coding-assistant-en","Why IBM’s Bob is the right kind of AI coding assistant","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778664645900-cyz4.png","2026-05-13T09:30:22.413196+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"d6653030-ee6d-4043-898d-d2de0388545b","evolving-world-prompt-engineering-en","The Evolving World of Prompt Engineering","2026-03-26T01:29:42.061205+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"231306b3-1594-45b2-af81-bb80e41182f2","claude-code-vs-cursor-2026-en","Claude Code vs Cursor in 2026","2026-03-26T13:27:14.177468+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00"]