Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks
Anthropic’s unreleased Mythos Preview beats GPT-5.4 and Gemini 3.1 Pro on coding, math, and agent tests, led by 97.6% on USAMO.

Anthropic has a model that it has not shipped to the public yet, and the numbers are hard to ignore. In its system card, Claude Mythos Preview posts 97.6% on USAMO, 93.9% on SWE-bench Verified, and 79.6% on OSWorld. Those scores put it ahead of OpenAI’s GPT-5.4 and Google DeepMind’s Gemini 3.1 Pro on most of the benchmarks Anthropic published.
The headline here is not that one model won a leaderboard. It is that the gap is wide in places where public models had recently looked very strong, especially math and agentic coding. On USAMO, GPT-5.4 had already looked elite at 95.2%. Mythos Preview pushed that to 97.6%, while Gemini 3.1 Pro landed at 74.4% and Claude Opus 4.6 at 42.3%.
That matters because benchmark ceilings are getting harder to move. When a model jumps from the mid-90s to the high-90s on a hard reasoning test, the change is often less about a flashy demo and more about better reliability under pressure. In practical terms, that can mean fewer dead ends in multi-step reasoning, cleaner code edits, and better performance on tasks where one wrong assumption breaks the whole answer.
What Anthropic says Mythos Preview is better at
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Anthropic’s table covers coding, math, scientific reasoning, long-context graph tasks, desktop automation, and multimodal figure interpretation. The pattern is clear: Mythos Preview is strongest where a model has to keep state across many steps and recover from mistakes without losing the thread.

Here are the most eye-catching numbers from the system card:
- SWE-bench Verified: 93.9% for Mythos Preview, 80.6% for Gemini 3.1 Pro, 80.8% for Opus 4.6
- SWE-bench Pro: 77.8% for Mythos Preview, 57.7% for GPT-5.4, 54.2% for Gemini 3.1 Pro
- Terminal-Bench 2.0: 82% for Mythos Preview, 75.1% for GPT-5.4, 68.5% for Gemini 3.1 Pro
- OSWorld: 79.6% for Mythos Preview, 75.0% for GPT-5.4, 72.7% for Opus 4.6
- GraphWalks BFS 256K-1M: 80% for Mythos Preview, 21.4% for GPT-5.4, 38.7% for Opus 4.6
The GraphWalks result is especially striking. Long-context reasoning has been one of the hardest areas for frontier models, because the model has to track structure over very large inputs rather than just answer a local question. A four-to-one lead over GPT-5.4 suggests Anthropic has improved more than just raw benchmark memorization.
There is also a useful detail in the SWE-bench Pro result. GPT-5.4’s 57.7% score was already treated as a serious milestone when it launched in early March, so Mythos Preview’s 77.8% is not a small win. It is the kind of jump that changes how teams think about automated code fixing, repo-wide edits, and agent loops that need to survive real software complexity.
Why the USAMO score got everyone’s attention
USAMO is the benchmark that grabs attention because it is hard to fake. It tests deep mathematical reasoning, not pattern matching on a common coding dataset. On that benchmark, Mythos Preview hit 97.6%, GPT-5.4 reached 95.2%, Gemini 3.1 Pro scored 74.4%, and Opus 4.6 came in at 42.3%.
That 2.4-point edge over GPT-5.4 is small on paper, but it matters because GPT-5.4 had already set the bar that people were using as a reference point. When a model clears a result that was recently treated as near the top end of public capability, the signal is that the new model is not just keeping pace. It is moving the ceiling.
“We think the model is the best coding model in the world,” said Dario Amodei, Anthropic’s CEO, in a 2024 interview with WIRED.
That quote was about an earlier generation of Anthropic systems, but it captures the company’s long-running emphasis. Anthropic has consistently pushed hard on coding and reasoning performance, and Mythos Preview looks like the latest proof that this is where its internal research is paying off.
The other thing to notice is the spread across benchmarks. On GPQA Diamond, which measures graduate-level scientific reasoning, Mythos Preview scored 94.5%, just ahead of Gemini 3.1 Pro at 94.3% and GPT-5.4 at 92.8%. That is a narrow win, but it also shows how crowded the top end has become on some tasks. On math-heavy and agent-heavy evaluations, the differences are much larger.
How these numbers compare with public models
Anthropic’s own system card makes an important caveat: Mythos Preview is unreleased, and the numbers come from Anthropic’s own evaluation setup using adaptive thinking at max effort over five trials. Competitor scores come from their own system cards and leaderboards, which do not always use the same harness or settings. So these are directional comparisons, not laboratory-grade head-to-head tests.

Even with that caveat, the spread is large enough to notice. Here is the practical comparison across several tasks:
- On Humanity’s Last Exam without tools, Mythos Preview scored 56.8%, ahead of GPT-5.4 at 39.8% and Gemini 3.1 Pro at 44.4%
- With tools, Mythos Preview reached 64.7%, compared with Gemini 3.1 Pro at 51.4% and GPT-5.4 at 52.1%
- On CharXiv Reasoning, Mythos Preview scored 86.1% without tools and 93.2% with tools, while Opus 4.6 scored 61.5% and 78.9%
- On OSWorld, Mythos Preview led GPT-5.4 by 4.6 points and Opus 4.6 by 6.9 points
Those deltas matter because they map to real product behavior. Coding benchmarks point to better developer agents. OSWorld points to better desktop automation. Long-context graph tasks point to better memory over complex workflows. Put together, the table says Anthropic is aiming at models that can do more than answer prompts. They can keep working through messy, multi-step tasks.
There is also a strategic angle here. Public releases are only part of what frontier labs build. OpenAI has repeatedly tested unreleased models under codenames before launch, and Google has kept powerful systems internal for long stretches too. The public leaderboard is the visible edge of a much larger private race.
What Mythos Preview tells us about the next model cycle
Mythos Preview does not prove that Anthropic has a permanent lead. It does show that the company has an internal model that is ahead of the current public frontier on several hard tasks, especially coding and reasoning under long context. That is enough to change expectations around the next release cycle.
If Anthropic ships something close to these numbers, the market will likely focus on two questions. First, how much of this performance survives a real product setting with lower latency and tighter safety controls. Second, how quickly OpenAI and Google answer with their own next models. A few points on USAMO and a double-digit jump on SWE-bench Pro can reshape developer attention very fast.
For now, the most useful takeaway is simple: if you are choosing a model for code assistance, agent workflows, or math-heavy tasks, the frontier is moving faster in private than it is in public. The next few months should show whether Mythos Preview is a preview of a durable lead or just the first visible sign of a new release wave.
If you follow this space closely, watch for one thing in the next Anthropic release: whether the company keeps this level of long-context and coding performance while making the model fast enough for everyday use. That tradeoff will matter more than the benchmark chart the moment developers start paying for tokens.
// Related Articles
- [MODEL]
MiniMax-M1 brings 1M-token open reasoning model
- [MODEL]
Gemini Omni Video Review: Text Rendering Beats Rivals
- [MODEL]
Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots
- [MODEL]
OpenAI’s Realtime Audio Models Target Live Voice
- [MODEL]
Anthropic发布10款金融AI Agent
- [MODEL]
Why Claude’s “Infinite” Context Window Still Won’t Make AI Autonomous