[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-microsoft-goalcover-fine-tuning-gaps-zh":3,"tags-microsoft-goalcover-fine-tuning-gaps-zh":37,"related-lang-microsoft-goalcover-fine-tuning-gaps-zh":47,"related-posts-microsoft-goalcover-fine-tuning-gaps-zh":51,"series-research-f659bb6c-0788-4653-a1b6-53f8798c8564":88},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":21,"translated_content":10,"views":22,"is_premium":23,"created_at":24,"updated_at":24,"cover_image":11,"published_at":25,"rewrite_status":26,"rewrite_error":10,"rewritten_from_id":27,"slug":28,"category":29,"related_article_id":30,"status":31,"google_indexed_at":32,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":33,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":23},"f659bb6c-0788-4653-a1b6-53f8798c8564","Microsoft GoalCover 找出微調缺口","\u003Cp data-speakable=\"summary\">\u003Ca href=\"\u002Ftag\u002Fmicrosoft\">Microsoft\u003C\u002Fa> Research 的 GoalCover 會在微調前找出資料缺口，幫團隊先看到模型還缺哪些能力。\u003C\u002Fp>\u003Cp>說白了，很多微調資料看起來很完整，實際上卻少了幾個關鍵子技能。這篇 Microsoft Research 的論文在 2026 年 4 月發表，還拿 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-14B\" target=\"_blank\" rel=\"noopener\">Qwen-3-14B\u003C\u002Fa> 做金融摘要強化微調，結果真的把 reward 拉上去。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>指標\u003C\u002Fth>\u003Cth>結果\u003C\u002Fth>\u003Cth>意思\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Target subgoal degradation\u003C\u002Ftd>\u003Ctd>25.6%\u003C\u002Ftd>\u003Ctd>目標子技能在污染測試中的平均下降\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Non-target subgoal degradation\u003C\u002Ftd>\u003Ctd>2.1%\u003C\u002Ftd>\u003Ctd>非目標子技能的平均下降\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Cohen’s d\u003C\u002Ftd>\u003Ctd>1.24\u003C\u002Ftd>\u003Ctd>兩組影響差異很明顯\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>LLM-judge reward\u003C\u002Ftd>\u003Ctd>3.77 → 4.12\u003C\u002Ftd>\u003Ctd>未過濾資料對比 GoalCover 過濾資料\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Best reward\u003C\u002Ftd>\u003Ctd>4.20\u003C\u002Ftd>\u003Ctd>過濾資料加上合成樣本\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>GoalCover 到底在解什麼問題\u003C\u002Fh2>\u003Cp>訓練過 domain model 的人，應該都懂這種痛。資料量看起來夠，模型上線後卻老是漏掉幾個關鍵行為。講白了，不是模型不夠大，而是資料沒教到位。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778462447499-4gq4.png\" alt=\"Microsoft GoalCover 找出微調缺口\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Microsoft Research 的意思很直接。你在燒算力前，先看資料到底缺什麼。GoalCover 就是拿來做這件事的工具。它把一個大目標拆成多個 subgoal，再用 \u003Ca href=\"\u002Fnews\u002Fwhy-small-language-models-should-replace-llm-first-enterpris-zh\">LLM\u003C\u002Fa>-based alignm\u003Ca href=\"\u002Fnews\u002Fopenai-realtime-audio-models-live-voice-zh\">en\u003C\u002Fa>t 去看每筆資料對各個 subgoal 的覆蓋程度。\u003C\u002Fp>\u003Cp>這種做法比較像體檢，不像訓練。你不是等模型考砸了才抓問題，而是在訓練前先把資料洞補起來。這點我覺得很實際，因為修資料通常比重跑一次訓練便宜多了。\u003C\u002Fp>\u003Cul>\u003Cli>把大任務拆成可檢查的 subgoal。\u003C\u002Fli>\u003Cli>逐筆資料對齊每個 subgoal。\u003C\u002Fli>\u003Cli>找出低分區塊，定位缺口。\u003C\u002Fli>\u003Cli>決定要補資料、過濾資料，還是生合成樣本。\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼這組評估數字有意思\u003C\u002Fh2>\u003Cp>這篇不是只講概念。團隊做了兩種驗證。第一種是 controlled corruption 測試。第二種是實際的 downstream fine-tuning 任務。前者看方法論，後者看能不能真的幫到模型。\u003C\u002Fp>\u003Cp>在 corruption 測試裡，GoalCover 把目標子技能和非目標子技能分得很清楚。目標子技能平均掉了 25.6%，非目標子技能只掉 2.1%。這個差距不小。代表它抓到的不是亂七八糟的波動，而是比較像真的 capability gap。\u003C\u002Fp>\u003Cblockquote>“We introduce GoalCover, a framework that helps practitioners systematically detect capability gaps in fine-tuning datasets through interactive goal decomposition and automated coverage assessment.”\u003C\u002Fblockquote>\u003Cp>這句話很直白。它在做的不是新訓練法，而是訓練前診斷。這個定位很重要，因為很多團隊把問題怪到模型架構，其實根本是資料沒覆蓋到。\u003C\u002Fp>\u003Cp>對公司來說，這種診斷有省錢效果。你少跑一次失敗的微調，就少燒一次 \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa>。你也少花幾天在 debug 一個本來就學不會的模型。\u003C\u002Fp>\u003Ch2>Qwen-3-14B 的結果怎麼看\u003C\u002Fh2>\u003Cp>真正讓人點頭的，是金融摘要任務的結果。GoalCover 過濾後的資料，讓 \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa>-judge reward 從 3.77 升到 4.12。再加上 goal-conditioned synthetic samples，最高到 4.20。這不是神蹟，但很有用。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778462449208-ku62.png\" alt=\"Microsoft GoalCover 找出微調缺口\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這組數字的重點，不在於絕對分數多高，而在於改善來自資料選擇。不是換更大模型，也不是多訓幾輪。它證明一件事：資料覆蓋做對了，分數真的會動。\u003C\u002Fp>\u003Cul>\u003Cli>未過濾 baseline：3.77\u003C\u002Fli>\u003Cli>GoalCover 過濾後：4.12\u003C\u002Fli>\u003Cli>再加合成樣本：4.20\u003C\u002Fli>\u003Cli>測試任務：financial summarization RFT\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這也讓 GoalCover 比一般 data filter 更有意思。一般工具只會跟你說哪些資料爛。GoalCover 還會告訴你，為什麼爛，缺的是哪個 subgoal。\u003C\u002Fp>\u003Cp>對資料團隊來說，這很重要。因為你不只是在刪資料，而是在補能力。兩者差很多。前者像清垃圾，後者像修課綱。\u003C\u002Fp>\u003Ch2>和一般微調流程比，差在哪\u003C\u002Fh2>\u003Cp>傳統流程通常是先收資料、再標註、再訓練，最後才用 validation 看哪裡出問題。這流程沒錯，但很慢。很多盲點要等模型跑完才會浮出來。\u003C\u002Fp>\u003Cp>GoalCover 把一部分工作提前了。它先看 coverage，再決定要不要訓練。這對醫療 QA、法律摘要、金融摘要這種高風險任務特別有用。這些場景不怕模型很會講，怕的是它漏掉一個關鍵步驟。\u003C\u002Fp>\u003Cp>如果你是做 LLM workflow 的開發者，可以把它想成訓練前的資料儀表板。像 \u003Ca href=\"https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002F\" target=\"_blank\" rel=\"noopener\">Microsoft Research\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fhuggingface.co\" target=\"_blank\" rel=\"noopener\">Hugging Face\u003C\u002Fa>，再加上\u003Ca href=\"\u002Ftag\u002F開源模型\">開源模型\u003C\u002Fa>家族如 \u003Ca href=\"\u002Ftag\u002Fqwen\">Qwen\u003C\u002Fa>，這條流程可以長這樣：\u003C\u002Fp>\u003Cul>\u003Cli>先把任務拆成 atomic subgoals。\u003C\u002Fli>\u003Cli>再對資料集逐一評分。\u003C\u002Fli>\u003Cli>弱的地方先補。\u003C\u002Fli>\u003Cli>確認 coverage 後再開訓練。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這比直接把一大包資料丟進去好很多。至少你知道自己在教什麼，也知道模型還沒學到什麼。\u003C\u002Fp>\u003Ch2>這對微調團隊代表什麼\u003C\u002Fh2>\u003Cp>我覺得這篇最實用的地方，是它把 fine-tuning 問題拉回資料面。很多團隊一遇到效果差，就想加參數、換模型、改 loss。其實常常只是資料缺了一塊。\u003C\u002Fp>\u003Cp>GoalCover 的價值，就是讓能力覆蓋變成一個可檢查的指標。你不用等模型出包才知道。你可以在訓練前就先看見風險，這比事後補救乾脆多了。\u003C\u002Fp>\u003Cp>如果 Microsoft Research 之後把這套方法推到更多任務，下一個問題會是成本。它能不能在更多資料型態、更多標註風格、更多模型家族上維持同樣效果，才是關鍵。我的判斷很簡單：做 domain LLM 的團隊，應該把 capability coverage 納進 checkl\u003Ca href=\"\u002Fnews\u002Fmistral-cloud-coding-agents-vibe-medium-35-zh\">ist\u003C\u002Fa>。你不先問這題，後面常常白忙一場。\u003C\u002Fp>","Microsoft Research 的 GoalCover 會在微調前找出資料缺口，並在 Qwen-3-14B 的金融摘要任務上提升 reward 分數。","www.microsoft.com","https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fdiagnosing-capability-gaps-in-fine-tuning-data\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778462447499-4gq4.png",[13,14,15,16,17,18,19,20],"Microsoft Research","GoalCover","fine-tuning","Qwen-3-14B","reinforcement fine-tuning","資料覆蓋","LLM","reward","zh",0,false,"2026-05-11T01:20:33.736263+00:00","2026-05-11T01:20:33.627+00:00","done","8b62a61c-d623-4b94-89ab-3824d019abcc","microsoft-goalcover-fine-tuning-gaps-zh","research","18fb2e62-3d41-4b4c-8d65-e91e5f20ea28","published","2026-05-11T09:00:15+00:00",[34,35,36],"GoalCover 先找資料缺口，再開始微調。","Qwen-3-14B 的金融摘要任務，reward 從 3.77 升到 4.12。","它的價值在診斷能力覆蓋，不只是過濾髒資料。",[38,40,41,43,45],{"name":16,"slug":39},"qwen-3-14b",{"name":15,"slug":15},{"name":13,"slug":42},"microsoft-research",{"name":17,"slug":44},"reinforcement-fine-tuning",{"name":14,"slug":46},"goalcover",{"id":30,"slug":48,"title":49,"language":50},"microsoft-goalcover-fine-tuning-gaps-en","Microsoft’s GoalCover finds fine-tuning gaps","en",[52,58,64,70,76,82],{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":29},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":29},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":29},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":29},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":29},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":83,"slug":84,"title":85,"cover_image":86,"image_url":86,"created_at":87,"category":29},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[89,94,99,104,109,114,119,124,129,134],{"id":90,"slug":91,"title":92,"created_at":93},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":135,"slug":136,"title":137,"created_at":138},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]