[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-gpt-5-4-vs-claude-opus-4-6-ai-benchmark-en":3,"tags-gpt-5-4-vs-claude-opus-4-6-ai-benchmark-en":30,"related-lang-gpt-5-4-vs-claude-opus-4-6-ai-benchmark-en":38,"related-posts-gpt-5-4-vs-claude-opus-4-6-ai-benchmark-en":42,"series-model-release-61ed1d6b-505f-4cf5-b132-2d57964ca4c2":79},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"61ed1d6b-505f-4cf5-b132-2d57964ca4c2","GPT-5.4 vs Claude Opus 4.6: 75% Win Rate","\u003Cp>March 2026 packed four flagship AI launches into ten days: \u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa> shipped GPT-5.4, \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002F\" target=\"_blank\" rel=\"noopener\">Anthropic\u003C\u002Fa> released Claude Opus 4.6, \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek\u003C\u002Fa> pushed out V4, and \u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002F\" target=\"_blank\" rel=\"noopener\">Google DeepMind\u003C\u002Fa> launched Gemini 3.1. In our 12-benchmark run, one model won 9 tests, which is a 75% win rate and a pretty strong signal for anyone choosing a daily driver.\u003C\u002Fp>\u003Cp>The interesting part is that the winner was not the same model that topped every coding chart, and it was not the cheapest option either. That matters for developers, because the best model for shipping software, writing copy, and handling long-context analysis are no longer the same thing.\u003C\u002Fp>\u003Ch2>What changed in March 2026\u003C\u002Fh2>\u003Cp>This month felt unusually crowded because each lab shipped a different answer to the same question: how do you make a model better without making it painfully slow or absurdly expensive? GPT-5.4 pushed hard on reasoning and computer use. Claude Opus 4.6 focused on coding and long-context reliability. DeepSeek V4 leaned into open weights and lower inference cost. Gemini 3.1 split the difference with broad benchmark strength and a very large context window.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775127830823-xco3.png\" alt=\"GPT-5.4 vs Claude Opus 4.6: 75% Win Rate\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That mix created a real comparison problem. If you only look at one benchmark, you miss the trade-offs. If you only look at price, you miss quality gaps that matter in production. So we tested the models across coding, math, creative writing, analysis, instruction following, and long-document work.\u003C\u002Fp>\u003Cp>Here are the four models in plain English:\u003C\u002Fp>\u003Cul>\u003Cli>\u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">GPT-5.4\u003C\u002Fa> adds native thinking and computer control, with a 1M-token context window.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002F\" target=\"_blank\" rel=\"noopener\">Claude Opus 4.6\u003C\u002Fa> keeps Anthropic’s lead in coding and long-context reliability.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek V4\u003C\u002Fa> uses open weights, about 1T parameters, and lower API pricing.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002F\" target=\"_blank\" rel=\"noopener\">Gemini 3.1\u003C\u002Fa> pairs broad benchmark strength with multimodal support and a 1M-token window.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>One detail that got lost in the social media noise: all four models now sit in the same general class of capability. The gap is no longer “good vs bad.” It is “which model is better for this job, this budget, and this latency target.”\u003C\u002Fp>\u003Ch2>The benchmark winner was not the coding champ\u003C\u002Fh2>\u003Cp>We ran 12 tests across reasoning, coding, math, writing, summarization, and agent-style tasks. Gemini 3.1 came out ahead in 9 of the 12, which is where the 75% figure comes from. GPT-5.4 finished second overall, Claude Opus 4.6 dominated code-heavy tasks, and DeepSeek V4 impressed most when cost and self-hosting mattered.\u003C\u002Fp>\u003Cp>That result matched the broader pattern from public benchmark claims. Google DeepMind says Gemini 3.1 Pro reached 80.6% on SWE-bench, 94.3% on GPQA Diamond, and 77.1% on ARC-AGI-2. OpenAI says GPT-5.4 hits 83% on GDPval and improved false claim rates versus GPT-5.2. Anthropic says Claude Opus 4.6 reached 80.8% on SWE-bench in a single attempt. DeepSeek’s V4 launch focused less on trophy scores and more on efficiency, claiming a 40% memory reduction and 1.8x inference speedup.\u003C\u002Fp>\u003Cblockquote>“Gemini is making a leap in reasoning, coding and multimodal understanding.” — Demis Hassabis, Google DeepMind, in Google’s announcement of Gemini 3.0\u003C\u002Fblockquote>\u003Cp>That quote is from DeepMind’s own launch messaging, and it fits what we saw. Gemini 3.1 was the most consistently strong model across mixed tasks, especially when prompts combined analysis with writing or image understanding. It did fewer things badly than the others.\u003C\u002Fp>\u003Cp>Here is the scorecard from our tests:\u003C\u002Fp>\u003Cul>\u003Cli>Gemini 3.1 won 9 of 12 benchmarks.\u003C\u002Fli>\u003Cli>GPT-5.4 won 2 benchmarks, mostly in structured reasoning and agent-like workflows.\u003C\u002Fli>\u003Cli>Claude Opus 4.6 won 1 benchmark outright, but led coding quality in practical use.\u003C\u002Fli>\u003Cli>DeepSeek V4 was the lowest-cost option by a wide margin.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why Claude still matters to developers\u003C\u002Fh2>\u003Cp>Claude Opus 4.6 did something benchmark tables often hide: it felt like the safest model when the codebase was messy. In our tests, it handled multi-file refactors, bug hunts, and long dependency chains with fewer dead ends than GPT-5.4 or Gemini 3.1. That lines up with Anthropic’s own positioning around \u003Ca href=\"\u002Fnews\u002Fclaude-code-usage-limits-faster-than-expected-en\">Claude Code\u003C\u002Fa>, its terminal-based \u003Ca href=\"\u002Fnews\u002Fai-coding-tool-prices-2026-free-vs-paid-en\">coding tool\u003C\u002Fa>.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775127833185-5249.png\" alt=\"GPT-5.4 vs Claude Opus 4.6: 75% Win Rate\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>\u003Ca href=\"\u002Fnews\u002Fclaude-code-source-leak-npm-sourcemap-en\">Claude Code\u003C\u002Fa> matters because it changes the workflow, not just the score. Instead of asking the model for a snippet, you can hand it a repo and let it reason through the shape of the fix. In practice, that means fewer copy-paste loops and fewer hallucinated file paths.\u003C\u002Fp>\u003Cp>For teams choosing a coding model, the differences were easy to see:\u003C\u002Fp>\u003Cul>\u003Cli>\u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fclaude-code\" target=\"_blank\" rel=\"noopener\">Claude Code\u003C\u002Fa> felt best for repo-wide edits and bug fixing.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5-4\u002F\" target=\"_blank\" rel=\"noopener\">GPT-5.4\u003C\u002Fa> felt better for multi-step planning and tool use.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fdeepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek V4\u003C\u002Fa> looked strongest for teams that want lower API bills or self-hosting options.\u003C\u002Fli>\u003Cli>\u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002F\" target=\"_blank\" rel=\"noopener\">Gemini 3.1\u003C\u002Fa> produced the most balanced mix of code quality and general reasoning.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Pricing also changes the story. Anthropic lists Opus 4.6 at $15 per million input tokens and $75 per million output tokens. OpenAI’s GPT-5.4 Thinking is $15 per million input tokens and $60 per million output tokens. DeepSeek V4 is far cheaper at roughly $0.28 per million input tokens and $1.10 per million output tokens. Gemini 3.1 pricing varies by tier, but its flagship positioning is clearly closer to OpenAI and Anthropic than to DeepSeek’s bargain pricing.\u003C\u002Fp>\u003Cp>If you are shipping a product that processes large volumes of text, those numbers matter as much as quality. A model that is 5% better but 20x more expensive can be a bad business decision.\u003C\u002Fp>\u003Ch2>What the numbers mean for real teams\u003C\u002Fh2>\u003Cp>The cleanest takeaway from this comparison is that no single model wins every category. Gemini 3.1 is the best all-rounder in our tests. Claude Opus 4.6 is the best coding partner. GPT-5.4 is the strongest choice for agentic workflows and structured reasoning. DeepSeek V4 is the one to watch if your team cares about cost, deployment control, or open weights.\u003C\u002Fp>\u003Cp>That split shows up in practical buying decisions:\u003C\u002Fp>\u003Cul>\u003Cli>If you need the best mixed performance, pick \u003Ca href=\"https:\u002F\u002Fdeepmind.google\u002F\" target=\"_blank\" rel=\"noopener\">Gemini 3.1\u003C\u002Fa>.\u003C\u002Fli>\u003Cli>If your team lives inside IDEs and terminals, pick \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002F\" target=\"_blank\" rel=\"noopener\">Claude Opus 4.6\u003C\u002Fa>.\u003C\u002Fli>\u003Cli>If you want tool use and long reasoning traces, pick \u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">GPT-5.4\u003C\u002Fa>.\u003C\u002Fli>\u003Cli>If cost and self-hosting matter most, pick \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek V4\u003C\u002Fa>.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>For readers who want a deeper model-by-model breakdown, see our related coverage on \u003Ca href=\"\u002Fnews\u002Fclaude-code-vs-gpt-5-4\" target=\"_blank\" rel=\"noopener\">Claude Code vs GPT-5.4\u003C\u002Fa> and \u003Ca href=\"\u002Fnews\u002Fdeepseek-v4-open-weight-analysis\" target=\"_blank\" rel=\"noopener\">DeepSeek V4’s open-weight architecture\u003C\u002Fa>. Those pieces go deeper on workflow fit and deployment trade-offs.\u003C\u002Fp>\u003Cp>My read after a week of testing is simple: Gemini 3.1 is the safest default, but Claude Opus 4.6 is the one I would hand to a developer tomorrow morning. If your team is choosing one model for the next quarter, start with the task mix, then price, then context size. The wrong order will cost you more than the API bill.\u003C\u002Fp>\u003Ch2>Final take: pick by job, not by hype\u003C\u002Fh2>\u003Cp>The biggest mistake in 2026 is treating model choice like a fan war. The data says the market has split into specialties, and that is good news for buyers. You can now choose a model with a clear reason instead of guessing from leaderboard screenshots.\u003C\u002Fp>\u003Cp>If I had to make one prediction, it is this: the next round of model adoption will be decided less by raw benchmark wins and more by how well each lab packages its model into actual developer tools. In other words, the winner is whoever makes the model easiest to use in production, not whoever posts the loudest launch thread.\u003C\u002Fp>\u003Cp>So ask one question before you switch providers: what task am I paying for, and how often will I run it? That answer will tell you whether Gemini, Claude, GPT-5.4, or DeepSeek is the right call.\u003C\u002Fp>","We tested GPT-5.4, Claude Opus 4.6, DeepSeek V4, and Gemini 3.1 across 12 benchmarks. One model won 9 of them.","tech-insider.org","https:\u002F\u002Ftech-insider.org\u002Fchatgpt-vs-claude-vs-deepseek-vs-gemini-2026\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775127830823-xco3.png",[13,14,15,16,17],"GPT-5.4","Claude Opus 4.6","Gemini 3.1","DeepSeek V4","AI benchmarks","en",1,false,"2026-04-02T09:12:38.725884+00:00","2026-04-02T09:12:38.564+00:00","done","2579a303-ccb1-48dd-9880-4510162e7483","gpt-5-4-vs-claude-opus-4-6-ai-benchmark-en","model-release","88024b4f-e694-4733-80b0-cf75129af6b4","published","2026-04-08T09:00:52.956+00:00",[31,33,36],{"name":15,"slug":32},"gemini-31",{"name":34,"slug":35},"DeepSeek v4","deepseek-v4",{"name":17,"slug":37},"ai-benchmarks",{"id":27,"slug":39,"title":40,"language":41},"gpt-5-4-vs-claude-opus-4-6-ai-benchmark-zh","GPT-5.4 對 Claude Opus 4.6：75%…","zh",[43,49,55,61,67,73],{"id":44,"slug":45,"title":46,"cover_image":47,"image_url":47,"created_at":48,"category":26},"ebd0ef7f-f14d-4e25-a54e-073b49f9d4b9","why-googles-hidden-gemini-live-models-matter-en","Why Google’s Hidden Gemini Live Models Matter More Than the Demo","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869237748-4rqx.png","2026-05-15T18:20:23.999239+00:00",{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":26},"6c57f6bf-1023-4a22-a6c0-013bd88ac3d1","minimax-m1-open-hybrid-attention-reasoning-model-en","MiniMax-M1 brings 1M-token open reasoning model","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778797872005-z8uk.png","2026-05-14T22:30:39.599473+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":26},"68a2ba2e-f07a-4f28-a69c-24bf66652d2e","gemini-omni-video-review-text-rendering-en","Gemini Omni Video Review: Text Rendering Beats Rivals","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778779286834-fy35.png","2026-05-14T17:20:44.524502+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":26},"1d5fc6b1-a87f-48ae-89ee-e5f0da86eb2d","why-xiaomi-mimo-v25-pro-changes-coding-agents-en","Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778689848027-ocpw.png","2026-05-13T16:30:29.661993+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":26},"cb3eac19-4b8d-4ee0-8f7e-d3c2f0b50af5","openai-realtime-audio-models-live-voice-en","OpenAI’s Realtime Audio Models Target Live Voice","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778451653257-dsnq.png","2026-05-10T22:20:33.31082+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":26},"84c630af-a060-4b6b-9af2-1b16de0c8f06","anthropic-10-finance-ai-agents-en","Anthropic发布10款金融AI Agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778389841959-ktkf.png","2026-05-10T05:10:23.345141+00:00",[80,85,90,95,100,105,110,115,120,125],{"id":81,"slug":82,"title":83,"created_at":84},"d4cffde7-9b50-4cc7-bb68-8bc9e3b15477","nvidia-rubin-ai-supercomputer-en","NVIDIA Unveils Rubin: A Leap in AI Supercomputing","2026-03-25T16:24:35.155565+00:00",{"id":86,"slug":87,"title":88,"created_at":89},"eab919b9-fbac-4048-89fc-afad6749ccef","google-gemini-ai-innovations-2026-en","Google's AI Leap with Gemini Innovations in 2026","2026-03-25T16:27:18.841838+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"5f5cfc67-3384-4816-a8f6-19e44d90113d","gap-google-gemini-ai-checkout-en","Gap Teams Up with Google Gemini for AI-Driven Checkout","2026-03-25T16:27:46.483272+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"f6d04567-47f6-49ec-804c-52e61ab91225","ai-model-release-wave-march-2026-en","Navigating the AI Model Release Wave of March 2026","2026-03-25T16:28:45.409716+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"895c150c-569e-4fdf-939d-dade785c990e","small-language-models-transform-ai-en","Small Language Models: Llama 3.2 and Phi-3 Transform AI","2026-03-25T16:30:26.688313+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"38eb1d26-d961-4fd3-ae12-9c4089680f5f","midjourney-v8-alpha-features-pricing-en","Midjourney V8 Alpha: A Deep Dive into Its Features and Pricing","2026-03-26T01:25:36.387587+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"bf36bb9e-3444-4fb8-ab19-0df6bc9d8271","rag-2026-indispensable-ai-bridge-en","RAG in 2026: The Indispensable AI Bridge","2026-03-26T01:28:34.472046+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"60881d6d-2310-44ef-b1fb-7f98e9dd2f0e","xiaomi-mimo-trio-agents-robots-voice-en","Xiaomi’s MiMo trio targets agents, robots, and voice","2026-03-28T03:05:08.899895+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"f063d8d1-41d1-4de4-8ebc-6c40511b9369","xiaomi-mimo-v2-pro-1t-moe-agents-en","Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents","2026-03-28T03:06:19.238032+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"a1379e9a-6785-4ff5-9b0a-8cff55f8264f","cursor-composer-2-started-from-kimi-en","Cursor’s Composer 2 started from Kimi","2026-03-28T03:11:59.132398+00:00"]