[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-benchmarks":3},{"tag":4,"articles":11},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"ee654d61-465d-4eec-8060-5b4afb694d7b","LLM benchmarks","llm-benchmarks",3,"LLM 基準測試用來比較模型在知識、數學推理、幻覺率、長上下文與對話品質上的表現，像 BenchLM、AIME 這類榜單常反映模型升級的實際差異，也影響選型與部署判斷。","LLM benchmarks compare models across knowledge, math reasoning, hallucination rate, long-context handling, and chat quality. Results from tests like BenchLM or AIME help teams judge real capability, not just model size or release hype.",[12,21,29,36,44],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"11b9773e-13af-447d-b9a1-7d3232201e4f","why-llm-leaderboards-are-wrong-about-model-quality-en","Why LLM Leaderboards Are Wrong About Model Quality","LLM leaderboards are useful, but they are the wrong way to choose a model for production.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778743847206-191w.png","en","2026-05-14T07:30:26.134864+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":26,"image_url":27,"cover_image":27,"language":19,"created_at":28},"0c006cb0-0acc-43c4-baba-ab78092f0d9b","kimi-k2-6-benchlm-2026-scores-en","Kimi K2.6 Scores: BenchLM’s 2026 Breakdown","Kimi K2.6 ranks #12 overall on BenchLM, with strong coding and agentic scores, plus a 256K context window and open weights.","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777900276785-cezo.png","2026-05-04T13:10:39.364394+00:00",{"id":30,"slug":31,"title":32,"summary":33,"category":26,"image_url":34,"cover_image":34,"language":19,"created_at":35},"cb45188a-2e6e-4ac7-95f0-39cbd2f7d7a2","gpt-5-4-benchmarks-2026-scores-rankings-en","GPT-5.4 Scores 97.6 in Knowledge Benchmarks","GPT-5.4 tops knowledge benchmarks with 97.6, ranks #2 overall on BenchLM, and posts a 1.05M-token context window.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082204490-nq2r.png","2026-04-13T12:09:40.792366+00:00",{"id":37,"slug":38,"title":39,"summary":40,"category":41,"image_url":42,"cover_image":42,"language":19,"created_at":43},"1433056d-0745-485f-9501-b6ce042e5516","aime-2026-leaderboard-qwen-leads-math-tests-en","AIME 2026 leaderboard: Qwen leads math tests","Qwen3.6 Plus tops the AIME 2026 math benchmark with 0.953, while 8 models show a wide gap in olympiad-style reasoning.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775179307904-87vj.png","2026-04-03T01:21:30.991592+00:00",{"id":45,"slug":46,"title":47,"summary":48,"category":26,"image_url":49,"cover_image":49,"language":19,"created_at":50},"a1ce1fa4-f4d5-4e96-93dc-2c39628ec0a3","grok-41-xai-quieter-upgrade-matters-en","Grok 4.1: xAI’s quieter upgrade that matters","xAI’s Grok 4.1 cuts hallucinations, boosts chat quality, and adds Fast and Thinking modes with 256k context and 2M-token API support.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775175352422-pgev.png","2026-04-03T00:15:30.256357+00:00"]