[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-benchmarks":3},{"tag":4,"articles":11},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"ee654d61-465d-4eec-8060-5b4afb694d7b","LLM benchmarks","llm-benchmarks",3,"LLM 基準測試用來比較模型在知識、數學推理、幻覺率、長上下文與對話品質上的表現，像 BenchLM、AIME 這類榜單常反映模型升級的實際差異，也影響選型與部署判斷。","LLM benchmarks compare models across knowledge, math reasoning, hallucination rate, long-context handling, and chat quality. Results from tests like BenchLM or AIME help teams judge real capability, not just model size or release hype.",[12],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"a7bca854-a4d9-4616-b651-e5d732a63255","5-llm-benchmarks-for-business-buyers-2026-zh","5 個 LLM 基準測試","5 個基準測試幫你判斷模型強弱、看懂分數失真，並選出最適合商務採購的測試。","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779161051251-hgbf.png","zh","2026-05-19T03:23:38.737225+00:00"]