[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-swe-bench-verified":3},{"tag":4,"articles":11},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"aeb14571-0546-4407-8ad9-01785c371c34","SWE-Bench Verified","swe-bench-verified",8,"SWE-bench Verified 是用真實 GitHub issue 與測試來評估模型修補程式碼能力的基準，常用來看 agentic coding、除錯與工具使用表現。它之所以重要，在於分數背後還牽涉 token 成本、上下文長度與部署可行性。","SWE-bench Verified is a benchmark for measuring how well models fix real GitHub issues against real tests, making it a useful signal for agentic coding, debugging, and tool use. It also exposes practical tradeoffs in token cost, context length, and deployment.",[12,21],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"11b9773e-13af-447d-b9a1-7d3232201e4f","why-llm-leaderboards-are-wrong-about-model-quality-en","Why LLM Leaderboards Are Wrong About Model Quality","LLM leaderboards are useful, but they are the wrong way to choose a model for production.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778743847206-191w.png","en","2026-05-14T07:30:26.134864+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":26,"image_url":27,"cover_image":27,"language":19,"created_at":28},"8958b20f-16e9-4838-b10e-d75865a3a3e5","claude-mythos-vs-opus-46-capability-jump-en","Claude Mythos vs Opus 4.6: How Big Is the Jump?","Leaked benchmarks suggest Claude Mythos could beat Opus 4.6 by a wide margin in coding, reasoning, and security tasks.","model-release","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775125823282-ov6z.png","2026-04-02T09:09:38.844497+00:00"]