AIME 2026 leaderboard: Qwen leads math tests
Qwen3.6 Plus tops the AIME 2026 math benchmark with 0.953, while 8 models show a wide gap in olympiad-style reasoning.

The AIME 2026 leaderboard is tiny, but the signal is strong: 8 models, a top score of 0.953, and a bottom score of 0.375. That spread says a lot about how uneven current models still are when the task shifts from chatty answers to olympiad-style math.
This benchmark uses all 30 problems from the 2026 American Invitational Mathematics Examination, split across AIME I and AIME II. Each answer is an integer from 000 to 999, which makes the evaluation clean and unforgiving.
What AIME 2026 is testing
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
AIME is not a trivia quiz. It asks models to carry several steps of symbolic reasoning, keep track of constraints, and avoid small arithmetic slips that ruin the final answer. That makes it a useful stress test for systems that claim they can reason through hard problems instead of just pattern-match on familiar wording.

The benchmark page on LLM Stats labels AIME 2026 as a math and reasoning benchmark for text models, with English as the language and a maximum score of 1. The score format is simple, but the task is not.
- 30 total problems from AIME I and AIME II
- Integer answers only, from 000 to 999
- Text-only evaluation
- 8 evaluated models
- 0 verified results, 8 self-reported results
That last point matters. These numbers are useful, but they are still self-reported. Until more results are verified, the leaderboard is best read as a snapshot of how vendors present their models under a hard math test, not as a final verdict.
Who is winning right now
Qwen takes the lead here with Alibaba Cloud's Qwen3.6 Plus at 0.953. Close behind is ByteDance's Seed 2.0 Pro at 0.942. Those two models are separated by only 0.011, which is small enough to matter when you are comparing top-tier reasoning systems.
The middle of the pack gets more crowded. Qwen3.5-397B-A17B lands at 0.913, while Google's Gemma 4 family shows a wider spread, from 0.892 down to 0.375 depending on size.
“The problem with math is not that it is hard, but that it is easy to be wrong in a way that looks right.” — Terence Tao
That quote fits this benchmark nicely. AIME does not reward confident prose. It rewards exactness, and it exposes models that can explain a solution path without actually landing on the right number.
The numbers that matter
The leaderboard is short enough that the differences are easy to read. The average score across all 8 models is 0.783, which is solid but not dominant. The standard deviation is 0.238, which tells you the group is spread out rather than clustered tightly around one performance level.

Here is the leaderboard in plain terms:
- Qwen3.6 Plus: 0.953
- Seed 2.0 Pro: 0.942
- Qwen3.5-397B-A17B: 0.913
- Gemma 4 31B: 0.892
- Gemma 4 26B-A4B: 0.883
- Seed 2.0 Lite: 0.883
- Gemma 4 E4B: 0.425
- Gemma 4 E2B: 0.375
The big story is the drop-off in the smaller Gemma variants. The 31B model is near the top, but the 8B and 5B versions fall sharply. That suggests scale still matters a lot for this kind of reasoning, even when the model family is the same.
There is also a practical takeaway for teams choosing models for math-heavy workflows. If your use case depends on exact symbolic reasoning, you cannot assume a smaller model will degrade gracefully. On AIME 2026, it does not.
How this compares with earlier benchmark habits
AIME-style benchmarks are different from broad knowledge tests like MMLU or coding tests like HumanEval. They punish shallow reasoning much more aggressively, and they make it harder for a model to hide behind fluent language.
That difference is why math benchmarks have become a favorite way to compare frontier models. A system can look great in a chat demo and still stumble on a contest problem that requires careful algebra, modular arithmetic, or combinatorics. AIME exposes that gap fast.
For readers tracking benchmark trends, it is also worth comparing this page with OraCore's coverage of broader model performance, such as Open LLM leaderboard trends. Math scores and general-purpose scores often move at different speeds, and that split is becoming more obvious with each new release.
Another useful detail: all 8 results on this page are unverified. That is not unusual for a fresh benchmark, but it does mean the numbers should be treated as vendor claims until an independent verification layer catches up.
What to watch next
AIME 2026 is already telling us something simple: the top models are getting very good at structured math, but the gap between the best and the rest is still wide. If the next wave of releases keeps pushing scores above 0.95 while smaller variants remain stuck far lower, model selection will matter more than many teams expect.
My bet is that this benchmark will become a standard checkpoint for any company shipping reasoning-focused models in 2026. If you build products that depend on exact answers, not just polished explanations, this is the kind of leaderboard you should watch before choosing a model.
The real question is whether the next update brings verified results. Until then, AIME 2026 is a useful scoreboard, but it is also a reminder to ask a harder question: when a model gets the right answer, can someone else reproduce it?
// Related Articles
- [RSCH]
TurboQuant and the SEO Shift for Small Sites
- [RSCH]
TurboQuant vs FP8: vLLM’s first broad test
- [RSCH]
LLMbda calculus gives agents safety rules
- [RSCH]
A simpler beamspace denoiser for mmWave MIMO
- [RSCH]
Why AI benchmark wins in cyber should scare defenders
- [RSCH]
Why Linux security needs a patch-wave mindset