[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-gemma-4-assistant-models-faster-draft-tokens-zh":3,"tags-gemma-4-assistant-models-faster-draft-tokens-zh":35,"related-lang-gemma-4-assistant-models-faster-draft-tokens-zh":46,"related-posts-gemma-4-assistant-models-faster-draft-tokens-zh":50,"series-tools-fe630502-5455-4001-a6bf-0643f9eb469d":87},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":31,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"fe630502-5455-4001-a6bf-0643f9eb469d","Gemma 4 助手模型加速草稿 Token","\u003Cp data-speakable=\"summary\">Gemma 4 助手模型用 centroid masking，讓草稿 \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> 生成更快，vLLM 也能直接吃到這個優化。\u003C\u002Fp>\u003Cp>\u003Ca href=\"\u002Ftag\u002Fgoogle\">Google\u003C\u002Fa> 的 \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemma\" target=\"_blank\" rel=\"noopener\">Gemma\u003C\u002Fa> 4 助手模型，這次不是在比參數量。重點是它把 specul\u003Ca href=\"\u002Fnews\u002Fmatz-ai-ruby-native-compiler-matters-zh\">at\u003C\u002Fa>ive decoding 的草稿路徑，切得更省算力。\u003C\u002Fp>\u003Cp>原本要掃大約 \u003Ca href=\"\u002Fnews\u002Fhow-ai-is-changing-social-media-2026-zh\">26\u003C\u002Fa>.2 萬個詞彙。現在先縮到約 4,\u003Ca href=\"\u002Fnews\u002Fmicrosoft-80-billion-ai-capex-decade-zh\">00\u003C\u002Fa>0 個 centroid。講白了，就是少做很多 dot product。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>數值\u003C\u002Fth>\u003Cth>意義\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>完整詞彙表\u003C\u002Ftd>\u003Ctd>約 262,000\u003C\u002Ftd>\u003Ctd>原本 lm_head 的基礎成本\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Centroid 候選集\u003C\u002Ftd>\u003Ctd>約 4,000\u003C\u002Ftd>\u003Ctd>草稿 token 先看的小集合\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>計算量下降\u003C\u002Ftd>\u003Ctd>約 45x\u003C\u002Ftd>\u003Ctd>lm_head 少做很多工作\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>最大上下文\u003C\u002Ftd>\u003Ctd>8,192\u003C\u002Ftd>\u003Ctd>vLLM 範例的 context window\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>每步草稿 token\u003C\u002Ftd>\u003Ctd>4\u003C\u002Ftd>\u003Ctd>speculative decoding 的預測數量\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Centroid masking 到底改了什麼\u003C\u002Fh2>\u003Cp>先講白話版。一般 LLM 在選下一個 token 時，會對整個詞彙表打分。詞彙表越大，\u003Ccode>lm_head\u003C\u002Fcode> 越吃算力。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278246167-hskc.png\" alt=\"Gemma 4 助手模型加速草稿 Token\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Gemma 4 的 E2B 與 E4B assistant checkpoints，用的是 centroid masking。它不是完全不算，而是先把候選範圍縮小，再去做選擇。\u003C\u002Fp>\u003Cp>這招很適合 speculative decoding。因為草稿模型要夠快，才有意義。草稿如果自己就很慢，主模型還沒驗證，時間就先燒掉了。\u003C\u002Fp>\u003Cul>\u003Cli>完整詞彙表約 262K\u003C\u002Fli>\u003Cli>候選集縮到約 4K\u003C\u002Fli>\u003Cli>vLLM 回報 \u003Ccode>lm_head\u003C\u002Fcode> 計算量約降 45x\u003C\u002Fli>\u003Cli>有 ordered embeddings 時會自動啟用\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼跑 vLLM 的人會在意\u003C\u002Fh2>\u003Cp>對 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\" target=\"_blank\" rel=\"noopener\">vLLM\u003C\u002Fa> 使用者來說，這種優化最討喜的地方是不用手動魔改。只要 checkpoint 帶有 centroid weights，設定裡的 \u003Ccode>use_ordered_embeddings: true\u003C\u002Fcode> 就能啟動。\u003C\u002Fp>\u003Cp>這比很多 inference 技巧好用。很多方案要改 kernel、改環境變數，或是找特定 fork。說真的，工程師最煩的就是這種「看起來很美，實作很醜」的東西。\u003C\u002Fp>\u003Cp>Gemma 4 這次比較務實。你如果已經在跑 speculative decoding，這個草稿路徑優化就能直接吃到。\u003C\u002Fp>\u003Cblockquote>“Speculative decoding can significantly accelerate generation when the draft model is much cheaper than the target model.” — \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.17192\" target=\"_blank\" rel=\"noopener\">Yaniv Leviathan、Matan Kalman、Yossi Matias\u003C\u002Fa>\u003C\u002Fblockquote>\u003Cp>這句話很直白。草稿模型便宜，整體才會快。草稿模型如果不便宜，所有優化都只是紙上談兵。\u003C\u002Fp>\u003Ch2>數字怎麼看，和一般草稿模型差在哪\u003C\u002Fh2>\u003Cp>先看數字。262K 變 4K，差了 65.5 倍的候選範圍縮減。vLLM 的說法是，\u003Ccode>lm_head\u003C\u002Fcode> 計算量大約少 45 倍。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278256054-2cu9.png\" alt=\"Gemma 4 助手模型加速草稿 Token\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>但別直接把 45x 當成端到端速度。因為主模型還要驗證草稿 token。真正的總體加速，還要看拒絕率、batch size、GPU 型號，和上下文長度。\u003C\u002Fp>\u003Cp>不過，這個優化還是很實在。它砍掉的是 speculative decoding 裡最常見的浪費之一，也就是草稿階段的詞彙打分成本。\u003C\u002Fp>\u003Cul>\u003Cli>普通草稿模型：每步仍要掃大詞彙表\u003C\u002Fli>\u003Cli>Gemma 4 助手模型：先用 centroid 篩候選\u003C\u002Fli>\u003Cli>結果：草稿路徑更省算力\u003C\u002Fli>\u003Cli>影響：更容易把 speculative decoding 放進 production\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>跟其他推理做法比，差在哪\u003C\u002Fh2>\u003Cp>很多人一聽到加速，就想到量化、\u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa>、flash attention。那些都重要，但它們處理的是不同瓶頸。\u003C\u002Fp>\u003Cp>centroid masking 處理的是草稿 token 的選擇成本。它不是把模型變小，也不是把精度硬壓下去。它是把不必要的候選先擋掉。\u003C\u002Fp>\u003Cp>這種作法比較像工程上的省錢術。不是喊口號，而是直接少算一大段。對伺服器成本敏感的團隊，這種東西很有感。\u003C\u002Fp>\u003Cul>\u003Cli>量化：主要省記憶體與部分算力\u003C\u002Fli>\u003Cli>KV cache：主要省長上下文推理成本\u003C\u002Fli>\u003Cli>speculative decoding：靠草稿模型先猜 token\u003C\u002Fli>\u003Cli>centroid masking：再把草稿模型的候選集縮小\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這個做法放在產業裡怎麼看\u003C\u002Fh2>\u003Cp>LLM 服務現在很現實。大家不只看 \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa>，也看每秒 token、每張 GPU 的吞吐量，還有延遲抖動。\u003C\u002Fp>\u003Cp>Gemma 4 這種做法，對雲端推理很有意義。因為草稿路徑越便宜，越容易把 speculative decoding 變成預設選項，而不是實驗室裡的 demo。\u003C\u002Fp>\u003Cp>另一個重點是可部署性。這次不是叫你重寫整個 serving stack，而是讓 checkpoint 自帶 centroid weights。這點很務實。\u003C\u002Fp>\u003Cp>如果你在看\u003Ca href=\"\u002Ftag\u002F開源模型\">開源模型\u003C\u002Fa>，還是可以把它跟 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\" target=\"_blank\" rel=\"noopener\">Hugging Face\u003C\u002Fa> 上其他 checkpoint 一起比較。重點不是誰名氣大，而是誰在 production 比較省錢。\u003C\u002Fp>\u003Ch2>接下來該看什麼\u003C\u002Fh2>\u003Cp>我覺得下一步要觀察兩件事。第一，這種 ordered embeddings 會不會變成更多模型的標配。第二，vLLM 之外的 serving 框架會不會跟進。\u003C\u002Fp>\u003Cp>如果更多 assistant checkpoints 都內建 centroid weights，speculative decoding 會更容易落地。反過來說，如果只有少數模型有這個設計，它就還是特定場景的加速技巧。\u003C\u002Fp>\u003Cp>對開發者來說，現在最實際的動作很簡單。你如果在跑 Gemma 4，去確認自己是不是用了 assistant checkpoint。沒用對版本，等於白白浪費一大段草稿效率。\u003C\u002Fp>","Gemma 4 的 E2B 與 E4B 助手模型用 centroid masking，把草稿 token 的 lm_head 計算量砍到約 45 倍，且品質損失很小。","docs.vllm.ai","https:\u002F\u002Fdocs.vllm.ai\u002Fprojects\u002Frecipes\u002Fen\u002Flatest\u002FGoogle\u002FGemma4.html",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778278246167-hskc.png",[13,14,15,16,17,18],"Gemma 4","speculative decoding","centroid masking","vLLM","LLM 推理","草稿 token","zh",2,false,"2026-05-08T22:10:33.309766+00:00","2026-05-08T22:10:33.275+00:00","done","a5d94ce6-4444-4f28-8e0c-b4fee6d43401","gemma-4-assistant-models-faster-draft-tokens-zh","tools","6dcd6852-b95a-4f62-853a-cc7eb32fff1a","published","2026-05-09T09:00:14.542+00:00",[32,33,34],"Gemma 4 助手模型用 centroid masking，把草稿 token 候選集從約 262K 縮到約 4K。","vLLM 回報這能讓 lm_head 計算量約少 45x，而且啟用方式偏自動化。","這種優化最有價值的地方，是把 speculative decoding 的草稿路徑變得更便宜。",[36,38,40,42,44],{"name":13,"slug":37},"gemma-4",{"name":14,"slug":39},"speculative-decoding",{"name":16,"slug":41},"vllm",{"name":15,"slug":43},"centroid-masking",{"name":17,"slug":45},"llm-推理",{"id":28,"slug":47,"title":48,"language":49},"gemma-4-assistant-models-faster-draft-tokens-en","Gemma 4 assistant models get faster draft tokens","en",[51,57,63,69,75,81],{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":27},"68e4be16-dc38-4524-a6ea-5ebe22a6c4fb","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-zh","為什麼 VidHub 會員互通不是「買一次全設備通用」","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789450987-advz.png","2026-05-14T20:10:24.048988+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":27},"7a1e174f-746b-4e82-a0e3-b2475ab39747","why-buns-zig-to-rust-experiment-is-right-zh","為什麼 Bun 的 Zig-to-Rust 實驗是對的","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767879127-5dna.png","2026-05-14T14:10:26.886397+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":27},"e742fc73-5a65-4db3-ad17-88c99262ceb7","why-openai-api-pricing-is-product-strategy-zh","為什麼 OpenAI API 定價是產品策略，不是註腳","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749859485-chvz.png","2026-05-14T09:10:26.003818+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":27},"c757c5d8-eda9-45dc-9020-4b002f4d6237","why-claude-code-prompt-design-beats-ide-copilots-zh","為什麼 Claude Code 的提示設計贏過 IDE Copilot","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742645084-dao9.png","2026-05-14T07:10:29.371901+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":27},"4adef3ab-9f07-4970-91cf-77b8b581b348","why-databricks-model-serving-is-right-default-zh","為什麼 Databricks Model Serving 是生產推論的正確預設","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692245329-a2wt.png","2026-05-13T17:10:30.659153+00:00",{"id":82,"slug":83,"title":84,"cover_image":85,"image_url":85,"created_at":86,"category":27},"b3305057-451d-48e4-9fb9-69215f7effad","why-ibm-bob-right-kind-ai-coding-assistant-zh","為什麼 IBM 的 Bob 才是對的 AI 寫碼助手","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778664653510-64hc.png","2026-05-13T09:30:21.881547+00:00",[88,93,98,103,108,113,118,123,128,133],{"id":89,"slug":90,"title":91,"created_at":92},"de769291-4574-4c46-a76d-772bd99e6ec9","googles-biggest-gemini-launches-in-2026-zh","Google 2026 最大 Gemini 盤點","2026-03-26T07:26:39.21072+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"855cd52f-6fab-46cc-a7c1-42195e8a0de4","surepath-real-time-mcp-policy-controls-zh","SurePath 推出即時 MCP 政策控管","2026-03-26T07:57:40.77233+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"9b19ab54-edef-4dbd-9ce4-a51e4bae4ebb","mcp-in-2026-the-ai-tool-layer-teams-use-zh","2026 年 MCP：團隊真的在用的 AI 工具層","2026-03-26T08:01:46.589694+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"af9c46c3-7a28-410b-9f04-32b3de30a68c","prompting-in-2026-what-actually-works-zh","2026 提示工程，真正有用的是什麼","2026-03-26T08:08:12.453028+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"05553086-6ed0-4758-81fd-6cab24b575e0","garry-tan-open-sources-claude-code-toolkit-zh","Garry Tan 開源 Claude Code 工具包","2026-03-26T08:26:20.068737+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"042a73a2-18a2-433d-9e8f-9802b9559aac","github-ai-projects-to-watch-in-2026-zh","2026 必看 20 個 GitHub AI 專案","2026-03-26T08:28:09.619964+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"a5f94120-ac0d-4483-9a8b-63590071ac6a","claude-code-vs-cursor-2026-zh","Claude Code 與 Cursor 深度對比：202…","2026-03-26T13:27:14.279193+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"0975afa1-e0c7-4130-a20d-d890eaed995e","practical-github-guide-learning-ml-2026-zh","2026 機器學習入門 GitHub 實用指南","2026-03-27T01:16:49.712576+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"bfdb467a-290f-4a80-b3a9-6f081afb6dff","aiml-2026-student-ai-ml-lab-repo-review-zh","AIML-2026：像課綱的學生實驗 Repo","2026-03-27T01:21:51.467798+00:00",{"id":134,"slug":135,"title":136,"created_at":137},"80cabc3e-09fc-4ff5-8f07-b8d68f5ae545","ai-trending-github-repos-and-research-feeds-zh","AI Trending：把 AI 資源收成一張表","2026-03-27T01:31:35.262183+00:00"]