[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-模型評測":3},{"tag":4,"articles":10},{"id":5,"name":6,"slug":6,"article_count":7,"description_zh":8,"description_en":9},"5d1906c5-3a61-4806-9d99-0dffc1aa881f","模型評測",3,"模型評測關注的是 AI 模型在知識、推理、長上下文與真實任務上的表現，也包括 benchmark 是否可信。從分數爭議、提示詞對成績的影響，到不同模型在同一測試上的差異，這類內容幫助開發者判斷模型能否真正上線。","Model evaluation covers how AI systems perform on knowledge, reasoning, long-context tasks, and applied workloads, as well as whether benchmark results are trustworthy. It includes score disputes, prompt sensitivity, and cross-model comparisons that help developers judge real deployment readiness.",[11,20,28,36],{"id":12,"slug":13,"title":14,"summary":15,"category":16,"image_url":17,"cover_image":17,"language":18,"created_at":19},"9852e8e5-0ed0-47de-a7cc-f29508bf7e2a","why-llm-leaderboards-are-wrong-about-model-quality-zh","為什麼 LLM 排行榜常常選錯模型品質","LLM 排行榜有參考價值，但不適合拿來決定生產環境要用哪個模型。","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778743869534-q8ae.png","zh","2026-05-14T07:30:23.663726+00:00",{"id":21,"slug":22,"title":23,"summary":24,"category":25,"image_url":26,"cover_image":26,"language":18,"created_at":27},"b875d3ed-f892-43a8-a51e-920729e85b1e","gpt-5-4-benchmarks-2026-scores-rankings-zh","GPT-5.4 知識測驗拿 97.6 分","GPT-5.4 在 BenchLM 知識與理解拿到 97.6 分，總榜暫列第 2，還有 1.05M token 長上下文。這篇拆解它適合哪些工作、和其他模型怎麼比。","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082194973-cyii.png","2026-04-13T12:09:40.301446+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":33,"image_url":34,"cover_image":34,"language":18,"created_at":35},"87335969-ee48-4021-bd27-6731750537ff","duplicate-prompts-can-lift-accuracy-fast-zh","重複提示詞，準確率真的會上升","Google Research 研究發現，提示詞複製一次可讓 70 組模型與基準測試中的 47 組提升準確率，NameIndex 甚至從 21.33% 衝到 97.33%。","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775122500397-vvmh.png","2026-04-02T08:39:34.363421+00:00",{"id":37,"slug":38,"title":39,"summary":40,"category":16,"image_url":41,"cover_image":41,"language":18,"created_at":42},"e660d801-2421-4529-8fa9-86b82b066990","metas-llama-4-benchmark-scandal-gets-worse-zh","Meta Llama 4 分數風波又擴大","Meta 的 Llama 4 原本要延續開放模型聲勢，結果卻陷入評測分數爭議。最新報導指出，Meta 在發布前可能用不同模型跑不同 benchmark，讓分數看起來更好，信任問題也跟著擴大。","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1774516531283-08x2.png","2026-03-26T07:34:21.156421+00:00"]