[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-benchmark":3},{"tag":4,"articles":10},{"id":5,"name":6,"slug":6,"article_count":7,"description_zh":8,"description_en":9},"736c4d52-f7e2-4456-a45f-50aae8402b4e","benchmark",6,"Benchmark 不只是比誰分數高，而是用固定任務檢查模型、代理與編譯器在真實條件下的穩定性。從長鏈推理、資料視覺化工作流到程式碼安全與效能，基準測試也在考驗方法是否可信。","Benchmarking is how teams check whether models, agents, and compilers hold up under fixed tasks and real constraints. It covers long-horizon reasoning, data-viz workflows, code safety, and performance, while also exposing how much a score can be distorted by the test itself.",[11,20,27,34],{"id":12,"slug":13,"title":14,"summary":15,"category":16,"image_url":17,"cover_image":17,"language":18,"created_at":19},"442f0ac0-6fd2-460b-83ab-694f0627d98f","longmemeval-v2-agent-memory-web-workflows-en","LongMemEval-V2 tests agent memory in web workflows","A new benchmark checks whether agent memory can retain web-environment experience, not just user history, and improve long-term task recall.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778653256519-96ad.png","en","2026-05-13T06:20:30.047955+00:00",{"id":21,"slug":22,"title":23,"summary":24,"category":16,"image_url":25,"cover_image":25,"language":18,"created_at":26},"f414aa1a-27e8-45d9-b407-d542121915d2","llms-procedural-execution-diagnostic-study-en","When LLMs Stop Following Procedural Steps","A diagnostic benchmark shows LLMs lose procedural fidelity as step counts grow, even when the arithmetic stays simple.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777875670060-pmbt.png","2026-05-04T06:20:27.84519+00:00",{"id":28,"slug":29,"title":30,"summary":31,"category":16,"image_url":32,"cover_image":32,"language":18,"created_at":33},"6bf86d0c-df4b-4e0c-82b7-1c06b2ef80d5","asmr-bench-sabotage-detection-ml-code-en","ASMR-Bench Tests Sabotage Detection in ML Code","ASMR-Bench probes whether auditors can spot subtle sabotage in ML research codebases, and the answer so far is: not reliably.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776665038230-idp9.png","2026-04-20T06:03:33.439449+00:00",{"id":35,"slug":36,"title":37,"summary":38,"category":16,"image_url":39,"cover_image":39,"language":18,"created_at":40},"9f62add5-cae5-47eb-abd5-2e56d0d5698c","longcot-long-horizon-chain-of-thought-benchmark-en","LongCoT Benchmark: 2,500-Probl. Long-Horizon Reasoning","LongCoT is a 2,500-problem benchmark for measuring whether frontier models can sustain long, interdependent reasoning chains.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319782523-s0wz.png","2026-04-16T06:09:23.265233+00:00"]