[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-claude-mythos-preview-beats-gpt-54-gemini-benchmarks-en":3,"article-related-claude-mythos-preview-beats-gpt-54-gemini-benchmarks-en":31,"series-model-release-993f67fa-c342-4b67-b7f6-144efc0a0eca":79},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"993f67fa-c342-4b67-b7f6-144efc0a0eca","Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks","\u003Cp>\u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\" target=\"_blank\" rel=\"noopener\">Anthropic\u003C\u002Fa> has a model that it has not shipped to the public yet, and the numbers are hard to ignore. In its system card, \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\" target=\"_blank\" rel=\"noopener\">Claude Mythos Preview\u003C\u002Fa> posts 97.6% on USAMO, 93.9% on SWE-bench Verified, and 79.6% on OSWorld. Those scores put it ahead of \u003Ca href=\"https:\u002F\u002Fopenai.com\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa>’s GPT-5.4 and \u003Ca href=\"https:\u002F\u002Fdeepmind.google\" target=\"_blank\" rel=\"noopener\">Google DeepMind\u003C\u002Fa>’s Gemini 3.1 Pro on most of the benchmarks Anthropic published.\u003C\u002Fp>\u003Cp>The headline here is not that one model won a leaderboard. It is that the gap is wide in places where public models had recently looked very strong, especially math and agentic coding. On USAMO, GPT-5.4 had already looked elite at 95.2%. \u003Ca href=\"\u002Fnews\u002Fanthropic-claude-mythos-preview-meaning-en\">Mythos Preview\u003C\u002Fa> pushed that to 97.6%, while Gemini 3.1 Pro landed at 74.4% and Claude Opus 4.6 at 42.3%.\u003C\u002Fp>\u003Cp>That matters because benchmark ceilings are getting harder to move. When a model jumps from the mid-90s to the high-90s on a hard reasoning test, the change is often less about a flashy demo and more about better reliability under pressure. In practical terms, that can mean fewer dead ends in multi-step reasoning, cleaner code edits, and better performance on tasks where one wrong assumption breaks the whole answer.\u003C\u002Fp>\u003Ch2>What Anthropic says Mythos Preview is better at\u003C\u002Fh2>\u003Cp>Anthropic’s table covers coding, math, scientific reasoning, long-context graph tasks, desktop automation, and multimodal figure interpretation. The pattern is clear: Mythos Preview is strongest where a model has to keep state across many steps and recover from mistakes without losing the thread.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082017256-9j4y.png\" alt=\"Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Here are the most eye-catching numbers from the system card:\u003C\u002Fp>\u003Cul>\u003Cli>SWE-bench Verified: 93.9% for Mythos Preview, 80.6% for Gemini 3.1 Pro, 80.8% for Opus 4.6\u003C\u002Fli>\u003Cli>SWE-bench Pro: 77.8% for Mythos Preview, 57.7% for GPT-5.4, 54.2% for Gemini 3.1 Pro\u003C\u002Fli>\u003Cli>Terminal-Bench 2.0: 82% for Mythos Preview, 75.1% for GPT-5.4, 68.5% for Gemini 3.1 Pro\u003C\u002Fli>\u003Cli>OSWorld: 79.6% for Mythos Preview, 75.0% for GPT-5.4, 72.7% for Opus 4.6\u003C\u002Fli>\u003Cli>GraphWalks BFS 256K-1M: 80% for Mythos Preview, 21.4% for GPT-5.4, 38.7% for Opus 4.6\u003C\u002Fli>\u003C\u002Ful>\u003Cp>The GraphWalks result is especially striking. Long-context reasoning has been one of the hardest areas for frontier models, because the model has to track structure over very large inputs rather than just answer a local question. A four-to-one lead over GPT-5.4 suggests Anthropic has improved more than just raw benchmark memorization.\u003C\u002Fp>\u003Cp>There is also a useful detail in the SWE-bench Pro result. GPT-5.4’s 57.7% score was already treated as a serious milestone when it launched in early March, so Mythos Preview’s 77.8% is not a small win. It is the kind of jump that changes how teams think about automated code fixing, repo-wide edits, and agent loops that need to survive real software complexity.\u003C\u002Fp>\u003Ch2>Why the USAMO score got everyone’s attention\u003C\u002Fh2>\u003Cp>USAMO is the benchmark that grabs attention because it is hard to fake. It tests deep mathematical reasoning, not pattern matching on a common coding dataset. On that benchmark, Mythos Preview hit 97.6%, GPT-5.4 reached 95.2%, Gemini 3.1 Pro scored 74.4%, and Opus 4.6 came in at 42.3%.\u003C\u002Fp>\u003Cp>That 2.4-point edge over GPT-5.4 is small on paper, but it matters because GPT-5.4 had already set the bar that people were using as a reference point. When a model clears a result that was recently treated as near the top end of public capability, the signal is that the new model is not just keeping pace. It is moving the ceiling.\u003C\u002Fp>\u003Cblockquote>“We think the model is the best coding model in the world,” said Dario Amodei, Anthropic’s CEO, in a 2024 interview with \u003Ca href=\"https:\u002F\u002Fwww.wired.com\u002Fstory\u002Fanthropic-ceo-dario-amodei-ai-safety\u002F\" target=\"_blank\" rel=\"noopener\">WIRED\u003C\u002Fa>.\u003C\u002Fblockquote>\u003Cp>That quote was about an earlier generation of Anthropic systems, but it captures the company’s long-running emphasis. Anthropic has consistently pushed hard on coding and reasoning performance, and Mythos Preview looks like the latest proof that this is where its internal research is paying off.\u003C\u002Fp>\u003Cp>The other thing to notice is the spread across benchmarks. On GPQA Diamond, which measures graduate-level scientific reasoning, Mythos Preview scored 94.5%, just ahead of Gemini 3.1 Pro at 94.3% and GPT-5.4 at 92.8%. That is a narrow win, but it also shows how crowded the top end has become on some tasks. On math-heavy and agent-heavy evaluations, the differences are much larger.\u003C\u002Fp>\u003Ch2>How these numbers compare with public models\u003C\u002Fh2>\u003Cp>Anthropic’s own system card makes an important caveat: Mythos Preview is unreleased, and the numbers come from Anthropic’s own evaluation setup using adaptive thinking at max effort over five trials. Competitor scores come from their own system cards and leaderboards, which do not always use the same harness or settings. So these are directional comparisons, not laboratory-grade head-to-head tests.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082013313-srrv.png\" alt=\"Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Even with that caveat, the spread is large enough to notice. Here is the practical comparison across several tasks:\u003C\u002Fp>\u003Cul>\u003Cli>On Humanity’s Last Exam without tools, Mythos Preview scored 56.8%, ahead of GPT-5.4 at 39.8% and Gemini 3.1 Pro at 44.4%\u003C\u002Fli>\u003Cli>With tools, Mythos Preview reached 64.7%, compared with Gemini 3.1 Pro at 51.4% and GPT-5.4 at 52.1%\u003C\u002Fli>\u003Cli>On CharXiv Reasoning, Mythos Preview scored 86.1% without tools and 93.2% with tools, while Opus 4.6 scored 61.5% and 78.9%\u003C\u002Fli>\u003Cli>On OSWorld, Mythos Preview led GPT-5.4 by 4.6 points and Opus 4.6 by 6.9 points\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Those deltas matter because they map to real product behavior. Coding benchmarks point to better developer agents. OSWorld points to better desktop automation. Long-context graph tasks point to better memory over complex workflows. Put together, the table says Anthropic is aiming at models that can do more than answer prompts. They can keep working through messy, multi-step tasks.\u003C\u002Fp>\u003Cp>There is also a strategic angle here. Public releases are only part of what frontier labs build. \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002F\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa> has repeatedly tested unreleased models under codenames before launch, and Google has kept powerful systems internal for long stretches too. The public leaderboard is the visible edge of a much larger private race.\u003C\u002Fp>\u003Ch2>What Mythos Preview tells us about the next model cycle\u003C\u002Fh2>\u003Cp>Mythos Preview does not prove that Anthropic has a permanent lead. It does show that the company has an internal model that is ahead of the current public frontier on several hard tasks, especially coding and reasoning under long context. That is enough to change expectations around the next release cycle.\u003C\u002Fp>\u003Cp>If Anthropic ships something close to these numbers, the market will likely focus on two questions. First, how much of this performance survives a real product setting with lower latency and tighter safety controls. Second, how quickly OpenAI and Google answer with their own next models. A few points on USAMO and a double-digit jump on SWE-bench Pro can reshape developer attention very fast.\u003C\u002Fp>\u003Cp>For now, the most useful takeaway is simple: if you are choosing a model for code assistance, agent workflows, or math-heavy tasks, the frontier is moving faster in private than it is in public. The next few months should show whether Mythos Preview is a preview of a durable lead or just the first visible sign of a new release wave.\u003C\u002Fp>\u003Cp>If you follow this space closely, watch for one thing in the next Anthropic release: whether the company keeps this level of long-context and coding performance while making the model fast enough for everyday use. That tradeoff will matter more than the benchmark chart the moment developers start paying for tokens.\u003C\u002Fp>","Anthropic’s unreleased Mythos Preview beats GPT-5.4 and Gemini 3.1 Pro on coding, math, and agent tests, led by 97.6% on USAMO.","officechai.com","https:\u002F\u002Fofficechai.com\u002Fai\u002Fclaude-mythos-benchmarks-vs-gemini-3-1-pro-gpt-5-4\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776082017256-9j4y.png",[13,14,15,16,17,18],"Claude Mythos Preview","Anthropic","GPT-5.4","Gemini 3.1 Pro","SWE-bench","USAMO","en",0,false,"2026-04-13T12:06:36.377043+00:00","2026-04-13T12:06:36.34+00:00","done","31d183a6-6d5f-464b-b381-fe050a71534d","claude-mythos-preview-beats-gpt-54-gemini-benchmarks-en","model-release","682aaaec-fa6c-4990-8ed2-816079209d3c","published","2026-04-14T09:00:11.232+00:00",{"tags":32,"relatedLang":10,"relatedPosts":42},[33,35,37,39],{"name":18,"slug":34},"usamo",{"name":13,"slug":36},"claude-mythos-preview",{"name":14,"slug":38},"anthropic",{"name":40,"slug":41},"SWE-Bench","swe-bench",[43,49,55,61,67,73],{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":27},"ebd0ef7f-f14d-4e25-a54e-073b49f9d4b9","why-googles-hidden-gemini-live-models-matter-en","Why Google’s Hidden Gemini Live Models Matter More Than the Demo","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869237748-4rqx.png","2026-05-15T18:20:23.999239+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":27},"6c57f6bf-1023-4a22-a6c0-013bd88ac3d1","minimax-m1-open-hybrid-attention-reasoning-model-en","MiniMax-M1 brings 1M-token open reasoning model","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778797872005-z8uk.png","2026-05-14T22:30:39.599473+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":27},"68a2ba2e-f07a-4f28-a69c-24bf66652d2e","gemini-omni-video-review-text-rendering-en","Gemini Omni Video Review: Text Rendering Beats Rivals","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778779286834-fy35.png","2026-05-14T17:20:44.524502+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":27},"1d5fc6b1-a87f-48ae-89ee-e5f0da86eb2d","why-xiaomi-mimo-v25-pro-changes-coding-agents-en","Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778689848027-ocpw.png","2026-05-13T16:30:29.661993+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":27},"cb3eac19-4b8d-4ee0-8f7e-d3c2f0b50af5","openai-realtime-audio-models-live-voice-en","OpenAI’s Realtime Audio Models Target Live Voice","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778451653257-dsnq.png","2026-05-10T22:20:33.31082+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":27},"84c630af-a060-4b6b-9af2-1b16de0c8f06","anthropic-10-finance-ai-agents-en","Anthropic发布10款金融AI Agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778389841959-ktkf.png","2026-05-10T05:10:23.345141+00:00",[80,85,90,95,100,105,110,115,120,125],{"id":81,"slug":82,"title":83,"created_at":84},"d4cffde7-9b50-4cc7-bb68-8bc9e3b15477","nvidia-rubin-ai-supercomputer-en","NVIDIA Unveils Rubin: A Leap in AI Supercomputing","2026-03-25T16:24:35.155565+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"eab919b9-fbac-4048-89fc-afad6749ccef","google-gemini-ai-innovations-2026-en","Google's AI Leap with Gemini Innovations in 2026","2026-03-25T16:27:18.841838+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"5f5cfc67-3384-4816-a8f6-19e44d90113d","gap-google-gemini-ai-checkout-en","Gap Teams Up with Google Gemini for AI-Driven Checkout","2026-03-25T16:27:46.483272+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"f6d04567-47f6-49ec-804c-52e61ab91225","ai-model-release-wave-march-2026-en","Navigating the AI Model Release Wave of March 2026","2026-03-25T16:28:45.409716+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"895c150c-569e-4fdf-939d-dade785c990e","small-language-models-transform-ai-en","Small Language Models: Llama 3.2 and Phi-3 Transform AI","2026-03-25T16:30:26.688313+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"38eb1d26-d961-4fd3-ae12-9c4089680f5f","midjourney-v8-alpha-features-pricing-en","Midjourney V8 Alpha: A Deep Dive into Its Features and Pricing","2026-03-26T01:25:36.387587+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"bf36bb9e-3444-4fb8-ab19-0df6bc9d8271","rag-2026-indispensable-ai-bridge-en","RAG in 2026: The Indispensable AI Bridge","2026-03-26T01:28:34.472046+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"60881d6d-2310-44ef-b1fb-7f98e9dd2f0e","xiaomi-mimo-trio-agents-robots-voice-en","Xiaomi’s MiMo trio targets agents, robots, and voice","2026-03-28T03:05:08.899895+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"f063d8d1-41d1-4de4-8ebc-6c40511b9369","xiaomi-mimo-v2-pro-1t-moe-agents-en","Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents","2026-03-28T03:06:19.238032+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"a1379e9a-6785-4ff5-9b0a-8cff55f8264f","cursor-composer-2-started-from-kimi-en","Cursor’s Composer 2 started from Kimi","2026-03-28T03:11:59.132398+00:00"]