[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-tide-cross-architecture-diffusion-llm-distillation-zh":3,"tags-tide-cross-architecture-diffusion-llm-distillation-zh":31,"related-lang-tide-cross-architecture-diffusion-llm-distillation-zh":40,"related-posts-tide-cross-architecture-diffusion-llm-distillation-zh":44,"series-research-a2761ec3-eb6a-4982-b95c-0400b46b33f5":81},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"a2761ec3-eb6a-4982-b95c-0400b46b33f5","TIDE 讓跨架構蒸餾可行","\u003Cp data-speakable=\"summary\">TIDE 針對 diffusion LLM 的跨架構蒸餾，加入噪聲感知與 tokenizer 感知訓練，讓小模型更能學到大模型的能力。\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26951\">Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models\u003C\u002Fa> 這篇論文，切的不是一般常見的「把大模型縮小」問題，而是更麻煩的一種情境：老師模型和學生模型根本不是同一種架構。對 diffusion LLM 來說，這種差異會牽涉 attention 設計、tokenizer，甚至文字是怎麼被表示與對齊。作者提出 TIDE，就是要處理這個跨架構知識移轉的落差。\u003C\u002Fp>\u003Cp>這個題目很實際。diffusion large language m\u003Ca href=\"\u002Fnews\u002Fwhy-deepseek-v4-plus-claude-code-is-the-wrong-way-to-judge-c-zh\">ode\u003C\u002Fa>ls 本來就主打平行解碼、雙向上下文，理論上很有吸引力；但真正能撐起效果的系統，往往還是體積大、成本高。若想把能力壓到更小的模型上，蒸餾幾乎是必經之路。問題在於，過去很多方法預設學生只是老師的縮小版，架構大致對得上。只要老師和學生的內部表示開始分岔，這個假設就不太成立了。\u003C\u002Fp>\u003Ch2>這篇論文想解的痛點\u003C\u002Fh2>\u003Cp>傳統 dLLM 蒸餾，很多是在同一架構內做 inference steps 壓縮，重點是讓模型更快、更省。這類方法對單一架構很有用，但它沒有真正處理「跨架構轉移」：老師和學生在結構上不同，甚至連 tokenizer 都不一樣。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777529462046-z8hb.png\" alt=\"TIDE 讓跨架構蒸餾可行\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這件事不是小細節。學生模型學的不只是答案，還要學老師怎麼看文字、怎麼切 token、怎麼在不同遮罩狀態下做預測。如果 token 邊界不一致，或 masking 與 decoding 行為不同，老師給出的訊號就可能變得吵雜，甚至誤導學生。也就是說，單純模仿 logits，常常不夠。\u003C\u002Fp>\u003Cp>從摘要來看，TIDE 被定位成一個專門處理 cross-architecture dLLM distillation 的框架。它要解的不是「一個模型怎麼變快」，而是「當老師和學生不是同一種模型時，知識怎麼傳得過去」。\u003C\u002Fp>\u003Ch2>TIDE 到底怎麼運作\u003C\u002Fh2>\u003Cp>TIDE 由三個模組組成，而且每個模組都對準一種常見失真來源。第一個是 TIDAL。它會根據訓練進度和 diffusion timestep 來調整蒸餾強度。白話一點，就是老師在不同 timestep 的可靠度不一樣，學生不該用同一種力道去學所有階段。\u003C\u002Fp>\u003Cp>第二個模組是 CompDemo。它透過 complementary mask splitting 來補強老師的上下文。因為 diffusion 模型在高遮罩比例下做預測時，看到的上下文太少，老師本身也可能不穩。CompDemo 的想法，是用互補式的 mask 切分，讓老師能看到更完整的上下文，減少在重度 masking 下的失真。\u003C\u002Fp>\u003Cp>第三個模組是 \u003Ca href=\"\u002Fnews\u002Fred-hat-tank-os-openclaw-enterprise-safety-zh\">Re\u003C\u002Fa>verse CALM。摘要說它是一個 cross-tokenizer objective，核心是把 chunk-level likelihood matching 反過來處理，並帶來 bounded gradients 與 dual-end noise filtering。用比較白話的方式講，這像是在老師和學生 tokenizer 不一致時，提供一種更穩定的對齊方式，避免訓練過程因為 tokenization 差異而發散。\u003C\u002Fp>\u003Cp>三個模組合起來，分別處理三件事：\u003Ca href=\"\u002Fnews\u002Funtitled-zh\">什麼\u003C\u002Fa>時候該相信老師、怎麼讓老師的上下文更完整、以及怎麼跨 tokenizer 對齊輸出。這比「直接把老師輸出硬塞給學生」更貼近真實部署場景。\u003C\u002Fp>\u003Ch2>論文實際證明了什麼\u003C\u002Fh2>\u003Cp>摘要裡最明確的結果，是 TIDE 把 8B dense 和 16B MoE 的老師模型，蒸餾到一個 0.6B 學生模型，並且走了兩條 heterogeneous pipelines。作者表示，在八個 benchmark 上，蒸餾後的系統平均比 baseline 高 1.53 分。摘要沒有公開完整 benchmark 名稱與每項細節，所以這裡只能確認有八個測試與平均提升，不能補更多表格外資訊。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777529459775-yfi9.png\" alt=\"TIDE 讓跨架構蒸餾可行\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>比較醒目的單點結果出現在 code generation。HumanEval 上，TIDE 的分數是 48.78，對照 AR baseline 的 32.3。這個落差不小，代表它不只是把平均分數往上推，也可能在開發者很在意的下游任務上真的有感。\u003C\u002Fp>\u003Cp>不過，從目前可見的資訊來看，還是要保守解讀。這份來源只有摘要，沒有完整 benchmark 細節、沒有訓練成本、沒有 wall-clock time，也沒有更廣泛的蒸餾基線比較。換句話說，我們知道它有效，但還不知道成本多高、穩定性如何、或是不是只在作者列出的 heterogeneous pipelines 才有這種效果。\u003C\u002Fp>\u003Cul>\u003Cli>老師模型：8B dense、16B MoE\u003C\u002Fli>\u003Cli>學生模型：0.6B\u003C\u002Fli>\u003Cli>八個 benchmark 平均提升：1.53 分\u003C\u002Fli>\u003Cli>HumanEval：48.78 對 32.3（AR baseline）\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>對開發者有什麼影響\u003C\u002Fh2>\u003Cp>如果你在做模型部署，這篇最有意思的地方，不是蒸餾本身，而是它承認了現實中的 heterogeneity。實務上，團隊常常沒辦法挑到「老師和學生完全同款架構」的組合。你可能想壓縮的是不同家族的模型，甚至是 tokenizer、attention pattern 都不一樣的系統。\u003C\u002Fp>\u003Cp>TIDE 提示了一個很重要的方向：跨架構壓縮，可能不能只看輸出對不對，還要把蒸餾目標做成 representation-aware。也就是說，當老師和學生對文字的內部表示不一致時，蒸餾方法本身就得跟著調整。這篇的三個模組，正好對應了這個思路：噪聲感知、masking 下的上下文補強、以及 tokenizer 感知的對齊。\u003C\u002Fp>\u003Cp>這對實作端的啟發很直接。若你在評估蒸餾方案，除了看最終分數，也要問：老師和學生是不是同一種 tokenizer？注意力結構差多少？在不同 timestep 下，老師的訊號是否一樣可靠？如果答案都是否定的，那就不能期待傳統蒸餾 objective 自動幫你解決。\u003C\u002Fp>\u003Cp>同時，限制也很明顯。這篇目前能驗證的內容，仍然只來自 arXiv 摘要。它告訴我們方法存在、三個模組是什麼、以及 headline result 是多少，但沒有提供足夠資訊去判斷泛化能力、超參數敏感度、或額外複雜度是否值得。\u003C\u002Fp>\u003Cp>即便如此，TIDE 仍然是一個重要訊號。diffusion LLM 的研究，正在從「能不能蒸餾」走向「當老師和學生根本不是同一種語言時，還能不能蒸餾」。這篇的答案是可以，但前提是蒸餾過程要懂噪聲、懂 masking，也要懂 tokenizer 差異。\u003C\u002Fp>\u003Cp>對台灣開發者來說，這類工作最值得注意的，不只是分數提升，而是它把蒸餾問題從單純壓縮，推進到跨架構協作。未來如果要把大型 diffusion LLM 落地到更小的部署環境，這種「讓學生學會跟老師不同步的表示方式」的設計，可能會比單純縮參數更關鍵。\u003C\u002Fp>\u003Ch2>這篇可以怎麼看\u003C\u002Fh2>\u003Cp>如果只用一句話總結，TIDE 是在解一個很多蒸餾方法沒正面碰的問題：老師和學生不一樣時，怎麼讓知識真的傳下去。它不是把既有蒸餾再微調一下，而是把 timestep、mask、tokenizer 這三個容易出問題的地方都納入設計。\u003C\u002Fp>\u003Cp>而就目前摘要能證實的範圍來看，它至少在 0.6B 學生上做出了可量化的提升，也在 HumanEval 這種開發者熟悉的任務上交出明顯差距。剩下的問題，就要等完整論文看更多 ablation、更多成本資訊，才能判斷這套方法到底是研究上漂亮，還是實務上也夠划算。\u003C\u002Fp>","TIDE 針對 diffusion LLM 的跨架構蒸餾，加入噪聲感知權重與 tokenizer 感知目標，讓 0.6B 學生模型更接近大模型表現。","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26951",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777529462046-z8hb.png",[13,14,15,16,17,18],"diffusion LLM","knowledge distillation","cross-architecture","tokenizer","MoE","HumanEval","zh",0,false,"2026-04-30T06:10:31.730141+00:00","2026-04-30T06:10:31.112+00:00","done","bd68ad3d-49aa-4e83-bbb9-ff56861d5393","tide-cross-architecture-diffusion-llm-distillation-zh","research","2061a3d3-9d89-4722-ac8b-e359941b4573","published","2026-04-30T09:00:07.673+00:00",[32,33,34,36,38],{"name":15,"slug":15},{"name":16,"slug":16},{"name":17,"slug":35},"moe",{"name":14,"slug":37},"knowledge-distillation",{"name":13,"slug":39},"diffusion-llm",{"id":28,"slug":41,"title":42,"language":43},"tide-cross-architecture-diffusion-llm-distillation-en","TIDE distills diffusion LLMs across architectures","en",[45,51,57,63,69,75],{"id":46,"slug":47,"title":48,"cover_image":49,"image_url":49,"created_at":50,"category":27},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":27},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":27},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":27},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":27},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":27},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[82,87,92,97,102,107,112,117,122,127],{"id":83,"slug":84,"title":85,"created_at":86},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]