[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-2026-domain-specific-llm-benchmarks-map-zh":3,"article-related-2026-domain-specific-llm-benchmarks-map-zh":31,"series-research-d68bf7ed-a36e-4639-bcf0-aa15291a10ce":83},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":23,"views":27,"created_at":28,"published_at":29,"topic_cluster_id":30},"d68bf7ed-a36e-4639-bcf0-aa15291a10ce","2026-domain-specific-llm-benchmarks-map-zh","2026 垂直 LLM 基準地圖","\u003Cp data-speakable=\"summary\">Kili Technology 整理 \u003Ca href=\"\u002Fnews\u002Fmistral-ai-models-ranked-2026-zh\">2026\u003C\u002Fa> 年垂直 \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> 基準，涵蓋醫療、法律、金融、程式與資安。\u003C\u002Fp>\u003Cp>2026 年，\u003Ca href=\"https:\u002F\u002Fkili-technology.com\u002Fblog\u002Fdomain-specific-llm-benchmarks-guide\" target=\"_blank\" rel=\"noopener\">Kili Technology\u003C\u002Fa> 指出，通用測試像 MMLU 和 \u003Ca href=\"\u002Ftag\u002Fswe-bench\">SWE-Bench\u003C\u002Fa> 已很難拉開前沿模型差距，團隊開始改看更貼近真實工作的垂直評測。\u003C\u002Fp>\u003Cp>這份地圖把焦點放在醫療、法務、金融、科學、程式、資安與多語推理。對只看分數的買家來說，訊號很直接：榜單不再只是研究話題，而是採購前的門檻之一。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>數值\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Publication date\u003C\u002Ftd>\u003Ctd>May 21, 2026\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>HealthBench rubric criteria\u003C\u002Ftd>\u003Ctd>48,562\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>HealthBench physicians\u003C\u002Ftd>\u003Ctd>262\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>LegalBench-RAG pairs\u003C\u002Ftd>\u003Ctd>6,858\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>MMLU-ProX language gap\u003C\u002Ftd>\u003Ctd>24.3 points\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Claude Opus 4.5 on SWE-Bench Verified\u003C\u002Ftd>\u003Ctd>80.9%\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Claude Opus 4.5 on SEAL\u003C\u002Ftd>\u003Ctd>45.9%\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>發生了什麼\u003C\u002Fh2>\u003Cp>今年的基準測試明顯分裂成多個垂直賽道。醫療看診斷與臨床安全，法律看檢索與條文理解，金融看報表與風險判讀，程式看修 bug，資安看攻防情境，語言能力則要處理多語落差。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779649569393-il2i.png\" alt=\"2026 垂直 LLM 基準地圖\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這種拆分不是形式變化。當公開排行榜被模型反覆刷高後，它就不太能代表真實能力，反而更像一道及格線，逼團隊去找更細的工作負載。\u003C\u002Fp>\u003Cp>幾個例子很能說明這件事。\u003Ca href=\"https:\u002F\u002Fhealthbench.ai\" target=\"_blank\" rel=\"noopener\">HealthBench\u003C\u002Fa> 用 262 位醫師寫出 48,562 條評分規則，覆蓋 26 個專科與 60 個國家；\u003Ca href=\"https:\u002F\u002Flegalbench.ai\" target=\"_blank\" rel=\"noopener\">LegalBench-RAG\u003C\u002Fa> 則拿 6,858 組問答，測檢索是否真的能在法律語境裡找對內容。\u003C\u002Fp>\u003Cp>在語言與程式上，差距也被放大。MMLU-ProX 揭出高低資源語言在同題上的 24.3 分落差，而 \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-opus-4-5\" target=\"_blank\" rel=\"noopener\">Claude Opus 4.5\u003C\u002Fa> 在 \u003Ca href=\"\u002Ftag\u002Fswe-bench-verified\">SWE-Bench Verified\u003C\u002Fa> 拿到 80.9%，在 SEAL 只有 45.9%，顯示不同 \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> 會把同一\u003Ca href=\"\u002Fnews\u002Fwhy-washington-is-underreacting-to-ai-security-models-zh\">模型的\u003C\u002Fa>短板照得很清楚。\u003C\u002Fp>\u003Cul>\u003Cli>HealthBench：48,562 條 rubric，262 位醫師參與\u003C\u002Fli>\u003Cli>LegalBench-RAG：6,858 組專家標註問答\u003C\u002Fli>\u003Cli>MMLU-ProX：同題多語比較出 24.3 分差距\u003C\u002Fli>\u003Cli>Claude Opus 4.5：SWE-Bench Verified 80.9%，SEAL 45.9%\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼重要\u003C\u002Fh2>\u003Cp>對開發者來說，這代表「高分模型」和「可上線模型」之間的距離正在被量化。你可以在通用榜單拿到漂亮成績，但一進到病歷摘要、合約審閱或財報抽取，錯誤常常出在檢索、上下文切分、術語對齊與推理鏈條。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779649569265-gb5r.png\" alt=\"2026 垂直 LLM 基準地圖\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這也改變了產品選型方式。\u003Ca href=\"\u002Fnews\u002Fwhy-mistral-ai-is-safest-european-enterprises-zh\">企業\u003C\u002Fa>不再只問模型誰分數高，而是問它在自己的資料、自己的語言、自己的監管框架裡能不能過關。對醫療、法律和金融這類高風險場景，benchmark 已經開始接近採購清單，而不是研究附錄。\u003C\u002Fp>\u003Cp>從產業角度看，這會推動更多資料標註、專家審核與審計軌跡工具。當 EU AI Act 與 NIST AI RMF 這類要求進入部署流程，能不能說清楚模型在哪裡失手，往往比單一分數更重要。\u003C\u002Fp>\u003Cp>真正的問題不是模型有沒有刷榜，而是它在真實案例裡，會不會讓專業使用者願意簽字。\u003C\u002Fp>","Kili Technology 整理 2026 垂直 LLM 基準，涵蓋醫療、法律、金融、程式與資安。重點是通用榜單已不足以分出模型差距，採購與合規開始看專業評測。","kili-technology.com","https:\u002F\u002Fkili-technology.com\u002Fblog\u002Fdomain-specific-llm-benchmarks-guide",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779649569393-il2i.png","research","zh","7a07f021-272f-480d-87c1-a76b203f9b71",[17,18,19,20,21,22],"LLM benchmark","垂直評測","醫療 AI","法律 AI","SWE-Bench","多語推理",[24,25,26],"通用 benchmark 已難以區分前沿模型，垂直評測正成為新標準。","醫療、法律、金融與資安場景更看重專家標註與真實工作流。","採購與合規開始把 benchmark 當門檻，而不是單純的研究指標。",9,"2026-05-24T19:05:41.374014+00:00","2026-05-24T19:05:41.18+00:00","0c35a120-52fc-41fc-afa3-d404eb934158",{"tags":32,"relatedLang":42,"relatedPosts":46},[33,34,36,38,40],{"name":18,"slug":18},{"name":19,"slug":35},"醫療-ai",{"name":20,"slug":37},"法律-ai",{"name":21,"slug":39},"swe-bench",{"name":17,"slug":41},"llm-benchmark",{"id":15,"slug":43,"title":44,"language":45},"2026-domain-specific-llm-benchmarks-map-en","2026 domain-specific LLM benchmarks map","en",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"2a2b904a-d812-40ae-bdac-dc07bc6afd45","persona-pruner-lightweight-role-playing-models-zh","Persona-Pruner：把大模型修成角色專用小腦袋","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781505181281-pq7r.png","2026-06-15T06:32:24.904806+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"d77cb1e8-9169-416f-a673-317bc4e2ee39","clinhallu-medical-mllm-hallucination-benchmark-zh","ClinHallu 追蹤醫療 MLLM 幻覺來源","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781504269169-ifu4.png","2026-06-15T06:17:22.803066+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"8ceebbae-fce7-4672-9aaa-83f087961e43","gaze-heads-steering-vlms-attention-zh","用注意力頭引導 VLM 看圖說話","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781503374052-ojne.png","2026-06-15T06:02:26.201961+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"e6c76870-1fa5-45e5-bb8c-436070b9e5cc","ai-benchmarks-2026-evaluations-limits-zh","AI Benchmarks 2026：高分撞上天花板","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781381872937-6kjx.png","2026-06-13T20:17:25.971321+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"59cf2061-712e-4a92-b3a7-5bdd8644c5a6","art-fine-tunes-multimodal-llms-via-pixels-zh","用像素微調多模態 LLM","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781266684477-t1np.png","2026-06-12T12:17:31.662347+00:00",{"id":78,"slug":79,"title":80,"cover_image":81,"image_url":81,"created_at":82,"category":13},"e08b8946-29a0-486a-b2c1-b23faf16b441","taxonomy-rwa-tokenization-blockchain-infrastructure-zh","RWA 代幣化的 23 維分類法","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781259482592-9fiv.png","2026-06-12T10:17:30.417901+00:00",[84,89,94,99,104,109,114,119,124,129],{"id":85,"slug":86,"title":87,"created_at":88},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"53a0dc54-0371-4e40-8d5e-74e94a73840c","geometry-aware-similarity-metrics-for-neural-representations-zh","超越距離測量：用微分幾何重新理解神經網路","2026-03-31T06:01:01.241968+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"fee7d472-a775-4b1d-bbc2-1e8bca1bbf8b","on-the-fly-repulsion-in-the-contextual-space-for-rich-divers-zh","讓AI繪圖更有創意：用排斥力提升生成多樣性","2026-03-31T06:01:25.439673+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"a9901203-d69b-447b-8854-15d14eab32b4","vision-aided-beam-prediction-cnn-eca-zh","影像輔助波束預測升級 CNN","2026-04-01T10:00:25.8073+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"b55e7dd4-0a24-4b3d-804d-b0309a03f498","triple-band-fss-mimo-antenna-sub-6-ghz-zh","三頻 FSS MIMO 天線瞄準 sub-6 GHz","2026-04-01T13:18:36.857305+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"f68290bd-e7f3-4b30-ba22-dcd4e0130a66","openclaw-1299-repos-eight-weeks-analysis-zh","OpenClaw 1299 個 Repo 的資料解讀","2026-04-02T05:03:45.208411+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"ed9f80eb-eb02-4d35-8ad4-0ddf428751dd","beam-coherence-aware-combining-mmwave-mimo-zh","毫米波 MIMO 的雙階合併法","2026-04-02T05:27:26.897188+00:00"]