[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-aime-2026-leaderboard-qwen-leads-math-tests-en":3,"tags-aime-2026-leaderboard-qwen-leads-math-tests-en":30,"related-lang-aime-2026-leaderboard-qwen-leads-math-tests-en":41,"related-posts-aime-2026-leaderboard-qwen-leads-math-tests-en":45,"series-research-1433056d-0745-485f-9501-b6ce042e5516":82},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"1433056d-0745-485f-9501-b6ce042e5516","AIME 2026 leaderboard: Qwen leads math tests","\u003Cp>The \u003Ca href=\"https:\u002F\u002Fllm-stats.com\u002Fbenchmarks\u002Faime-2026\" target=\"_blank\" rel=\"noopener\">AIME 2026\u003C\u002Fa> leaderboard is tiny, but the signal is strong: 8 models, a top score of 0.953, and a bottom score of 0.375. That spread says a lot about how uneven current models still are when the task shifts from chatty answers to olympiad-style math.\u003C\u002Fp>\u003Cp>This benchmark uses all 30 problems from the 2026 American Invitational Mathematics Examination, split across AIME I and AIME II. Each answer is an integer from 000 to 999, which makes the evaluation clean and unforgiving.\u003C\u002Fp>\u003Ch2>What AIME 2026 is testing\u003C\u002Fh2>\u003Cp>AIME is not a trivia quiz. It asks models to carry several steps of symbolic reasoning, keep track of constraints, and avoid small arithmetic slips that ruin the final answer. That makes it a useful stress test for systems that claim they can reason through hard problems instead of just pattern-match on familiar wording.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775179307904-87vj.png\" alt=\"AIME 2026 leaderboard: Qwen leads math tests\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The benchmark page on \u003Ca href=\"https:\u002F\u002Fllm-stats.com\" target=\"_blank\" rel=\"noopener\">LLM Stats\u003C\u002Fa> labels AIME 2026 as a math and reasoning benchmark for text models, with English as the language and a maximum score of 1. The score format is simple, but the task is not.\u003C\u002Fp>\u003Cul>\u003Cli>30 total problems from AIME I and AIME II\u003C\u002Fli>\u003Cli>Integer answers only, from 000 to 999\u003C\u002Fli>\u003Cli>Text-only evaluation\u003C\u002Fli>\u003Cli>8 evaluated models\u003C\u002Fli>\u003Cli>0 verified results, 8 self-reported results\u003C\u002Fli>\u003C\u002Ful>\u003Cp>That last point matters. These numbers are useful, but they are still self-reported. Until more results are verified, the leaderboard is best read as a snapshot of how vendors present their models under a hard math test, not as a final verdict.\u003C\u002Fp>\u003Ch2>Who is winning right now\u003C\u002Fh2>\u003Cp>\u003Ca href=\"https:\u002F\u002Fqwenlm.github.io\u002F\" target=\"_blank\" rel=\"noopener\">Qwen\u003C\u002Fa> takes the lead here with \u003Ca href=\"https:\u002F\u002Fwww.alibabacloud.com\u002Fen\" target=\"_blank\" rel=\"noopener\">Alibaba Cloud\u003C\u002Fa>'s \u003Ca href=\"https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F\" target=\"_blank\" rel=\"noopener\">Qwen3.6 Plus\u003C\u002Fa> at 0.953. Close behind is \u003Ca href=\"https:\u002F\u002Fwww.bytedance.com\u002F\" target=\"_blank\" rel=\"noopener\">ByteDance\u003C\u002Fa>'s Seed 2.0 Pro at 0.942. Those two models are separated by only 0.011, which is small enough to matter when you are comparing top-tier reasoning systems.\u003C\u002Fp>\u003Cp>The middle of the pack gets more crowded. Qwen3.5-397B-A17B lands at 0.913, while \u003Ca href=\"https:\u002F\u002Fblog.google\u002Ftechnology\u002Fai\u002F\" target=\"_blank\" rel=\"noopener\">Google\u003C\u002Fa>'s \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemma\" target=\"_blank\" rel=\"noopener\">Gemma 4\u003C\u002Fa> family shows a wider spread, from 0.892 down to 0.375 depending on size.\u003C\u002Fp>\u003Cblockquote>“The problem with math is not that it is hard, but that it is easy to be wrong in a way that looks right.” — \u003Ca href=\"https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTerence_Tao\" target=\"_blank\" rel=\"noopener\">Terence Tao\u003C\u002Fa>\u003C\u002Fblockquote>\u003Cp>That quote fits this benchmark nicely. AIME does not reward confident prose. It rewards exactness, and it exposes models that can explain a solution path without actually landing on the right number.\u003C\u002Fp>\u003Ch2>The numbers that matter\u003C\u002Fh2>\u003Cp>The leaderboard is short enough that the differences are easy to read. The average score across all 8 models is 0.783, which is solid but not dominant. The standard deviation is 0.238, which tells you the group is spread out rather than clustered tightly around one performance level.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775179305267-9xvg.png\" alt=\"AIME 2026 leaderboard: Qwen leads math tests\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Here is the leaderboard in plain terms:\u003C\u002Fp>\u003Cul>\u003Cli>Qwen3.6 Plus: 0.953\u003C\u002Fli>\u003Cli>Seed 2.0 Pro: 0.942\u003C\u002Fli>\u003Cli>Qwen3.5-397B-A17B: 0.913\u003C\u002Fli>\u003Cli>Gemma 4 31B: 0.892\u003C\u002Fli>\u003Cli>Gemma 4 26B-A4B: 0.883\u003C\u002Fli>\u003Cli>Seed 2.0 Lite: 0.883\u003C\u002Fli>\u003Cli>Gemma 4 E4B: 0.425\u003C\u002Fli>\u003Cli>Gemma 4 E2B: 0.375\u003C\u002Fli>\u003C\u002Ful>\u003Cp>The big story is the drop-off in the smaller Gemma variants. The 31B model is near the top, but the 8B and 5B versions fall sharply. That suggests scale still matters a lot for this kind of reasoning, even when the model family is the same.\u003C\u002Fp>\u003Cp>There is also a practical takeaway for teams choosing models for math-heavy workflows. If your use case depends on exact symbolic reasoning, you cannot assume a smaller model will degrade gracefully. On AIME 2026, it does not.\u003C\u002Fp>\u003Ch2>How this compares with earlier benchmark habits\u003C\u002Fh2>\u003Cp>AIME-style benchmarks are different from broad knowledge tests like \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002Fmmlu\u002F\" target=\"_blank\" rel=\"noopener\">MMLU\u003C\u002Fa> or coding tests like \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002Fhuman-eval\" target=\"_blank\" rel=\"noopener\">HumanEval\u003C\u002Fa>. They punish shallow reasoning much more aggressively, and they make it harder for a model to hide behind fluent language.\u003C\u002Fp>\u003Cp>That difference is why math benchmarks have become a favorite way to compare frontier models. A system can look great in a chat demo and still stumble on a contest problem that requires careful algebra, modular arithmetic, or combinatorics. AIME exposes that gap fast.\u003C\u002Fp>\u003Cp>For readers tracking benchmark trends, it is also worth comparing this page with OraCore's coverage of broader model performance, such as \u003Ca href=\"\u002Fnews\u002Fopen-llm-leaderboard-trends\" target=\"_blank\" rel=\"noopener\">Open LLM leaderboard trends\u003C\u002Fa>. Math scores and general-purpose scores often move at different speeds, and that split is becoming more obvious with each new release.\u003C\u002Fp>\u003Cp>Another useful detail: all 8 results on this page are unverified. That is not unusual for a fresh benchmark, but it does mean the numbers should be treated as vendor claims until an independent verification layer catches up.\u003C\u002Fp>\u003Ch2>What to watch next\u003C\u002Fh2>\u003Cp>AIME 2026 is already telling us something simple: the \u003Ca href=\"\u002Fnews\u002Fgemini-3-1-pro-googles-top-model-in-numbers-en\">top model\u003C\u002Fa>s are getting very good at structured math, but the gap between the best and the rest is still wide. If the next wave of releases keeps pushing scores above 0.95 while smaller variants remain stuck far lower, model selection will matter more than many teams expect.\u003C\u002Fp>\u003Cp>My bet is that this benchmark will become a standard checkpoint for any company shipping reasoning-focused models in 2026. If you build products that depend on exact answers, not just polished explanations, this is the kind of leaderboard you should watch before choosing a model.\u003C\u002Fp>\u003Cp>The real question is whether the next update brings verified results. Until then, AIME 2026 is a useful scoreboard, but it is also a reminder to ask a harder question: when a model gets the right answer, can someone else reproduce it?\u003C\u002Fp>","Qwen3.6 Plus tops the AIME 2026 math benchmark with 0.953, while 8 models show a wide gap in olympiad-style reasoning.","llm-stats.com","https:\u002F\u002Fllm-stats.com\u002Fbenchmarks\u002Faime-2026",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775179307904-87vj.png",[13,14,15,16,17],"AIME 2026","LLM benchmarks","math reasoning","Qwen3.6 Plus","Gemma 4","en",0,false,"2026-04-03T01:21:30.991592+00:00","2026-04-03T01:21:30.963+00:00","done","91d6d55b-732a-4198-8135-1a3b12a8cee1","aime-2026-leaderboard-qwen-leads-math-tests-en","research","5f593215-e1e5-4ea1-92f8-0a08d0ab97a8","published","2026-04-07T07:41:13.255+00:00",[31,33,35,37,39],{"name":17,"slug":32},"gemma-4",{"name":15,"slug":34},"math-reasoning",{"name":13,"slug":36},"aime-2026",{"name":16,"slug":38},"qwen36-plus",{"name":14,"slug":40},"llm-benchmarks",{"id":27,"slug":42,"title":43,"language":44},"aime-2026-leaderboard-qwen-leads-math-tests-zh","AIME 2026 排行榜：Qwen 先拿下數學測試","zh",[46,52,58,64,70,76],{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":26},"94994abd-e24d-4fd1-b941-942d03d19acf","turboquant-seo-shift-small-sites-en","TurboQuant and the SEO Shift for Small Sites","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840455122-jfce.png","2026-05-15T10:20:28.134545+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":26},"670a7f69-911f-41e8-a18b-7d3491253a19","turboquant-vllm-comparison-fp8-kv-cache-en","TurboQuant vs FP8: vLLM’s first broad test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839858405-b5ao.png","2026-05-15T10:10:37.219158+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":26},"5aef1c57-961f-49f7-8277-f83f7336799a","llmbda-calculus-agent-safety-rules-en","LLMbda calculus gives agents safety rules","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825459914-obkf.png","2026-05-15T06:10:36.242145+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":26},"712a0357-f7cd-48f2-adde-c2691da0815f","low-complexity-beamspace-denoiser-mmwave-mimo-en","A simpler beamspace denoiser for mmWave MIMO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814646705-e7mx.png","2026-05-15T03:10:31.764301+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":26},"f595f949-6ea1-4b0e-a632-f1832ef26e36","ai-benchmark-wins-cyber-scare-defenders-en","Why AI benchmark wins in cyber should scare defenders","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807444539-gz7f.png","2026-05-15T01:10:30.04579+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":26},"3ad202d1-9e5f-49c5-8383-02fcf1a23cf2","why-linux-security-needs-patch-wave-mindset-en","Why Linux security needs a patch-wave mindset","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741441493-ikl6.png","2026-05-14T06:50:25.906256+00:00",[83,88,93,98,103,108,113,118,123,128],{"id":84,"slug":85,"title":86,"created_at":87},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]