Claude vs GPT vs Gemini: Coding Benchmark Leaderboard
A June 2026 coding benchmark comparison of Claude, GPT, and Gemini for model buyers.

A June 2026 coding benchmark comparison of Claude, GPT, and Gemini for model buyers.
On the table are Claude, GPT, and Gemini, and this comparison helps you decide which one fits coding work when price, context, and benchmark evidence do not line up cleanly.
At a glance
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
| Dimension | Claude Fable 5 | Claude Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Input / output price per 1M tokens | $10 / $50 | $5 / $25 | $5 / $30 | $2 / $12 up to 200K, then $4 / $18 |
| Context window | 1M tokens | 1M tokens | 1,050,000 tokens | 1M tokens |
| Max output | 128K tokens | 128K tokens | 128K tokens | 64K tokens |
| Published coding score we could verify | Not machine-verifiable | Not machine-verifiable | 83.4% Terminal-Bench 2.1, per competitor attribution | 80.6% SWE-bench Verified; 54.2% SWE-bench Pro Public; 2887 Elo LiveCodeBench Pro |
| Knowledge cutoff | Not stated on overview | Jan 2026 | Dec 1, 2025 | Not stated on model card |
| Verification note | Official score table was image-based | Official score table was image-based | Primary page was not machine-readable to the fetcher | Scores came from Google’s official model card |
Claude: strongest on breadth, weaker on public benchmark visibility
Claude Fable 5 and Claude Opus 4.8 are the least tidy options to compare because Anthropic’s public coding tables were not machine-readable on the verification date. That does not mean they are weak models. It means the leaderboard-style proof is harder to extract from the source, so buyers have to lean more on the spec sheet, the product tier, and their own tests.

The practical distinction is price and positioning. Claude Opus 4.8 is the better value inside Anthropic’s lineup at $5 per million input tokens and $25 per million output tokens, with a 1M-token context window and 128K output. Fable 5 doubles that to $10 / $50, which signals a premium tier for teams that want Anthropic’s top release even when the benchmark table is not easy to quote back.
GPT: a strong middle ground if you want long context and a readable score
GPT-5.5 is the cleanest OpenAI option in this comparison because the pricing and context specs are easy to verify, and Anthropic’s page attributes a Terminal-Bench 2.1 score of 83.4% to it. The caveat is important: that number is competitor-reported, not a score we read directly from OpenAI, so it should be treated as directional rather than final proof.

Still, GPT-5.5 looks attractive for teams that want a large 1,050,000-token context window, 128K max output, and a familiar $5 / $30 price point. If your coding workflow involves long repo-wide prompts, agent loops, and lots of pasted context, GPT-5.5 gives you a lot of room without moving into the most expensive tier.
Gemini: best verified public scores and the lowest entry price
Gemini 3.1 Pro is the most benchmark-transparent model in this set because Google’s official model card publishes the figures directly: 80.6% on SWE-bench Verified, 54.2% on SWE-bench Pro Public, and 2887 Elo on LiveCodeBench Pro. Those are single-attempt results from the official card, which makes them especially useful if you care about public, source-backed numbers more than vendor-adjacent references.
It also has the sharpest price advantage at $2 / $12 per million tokens for prompts up to 200K, with higher tiers above that, but the trade-off is a 64K max output and a benchmark profile that is not directly comparable across every test harness. Gemini is the value play when you want a strong published score, lower cost, and you can live with shorter outputs.
When to pick what
Pick Claude Opus 4.8 if you want the safest Anthropic default for everyday agentic coding and you care more about a balanced price-to-capability mix than a headline benchmark citation.
Pick GPT-5.5 if your team works in very long contexts, wants a large output budget, and prefers a model with a strong but caveated benchmark signal that still reads well in vendor comparisons.
Pick Gemini 3.1 Pro if you want the lowest cost, the most clearly published public coding scores, and a model that is easy to justify in a procurement review because the source numbers are directly visible.
Default to Gemini 3.1 Pro for cost-sensitive coding teams, but switch to GPT-5.5 when your workflows regularly need the extra-long 1,050,000-token context window and higher-output headroom.
// Related Articles
- [IND]
Clip Converter’s 2026 rivals are faster and safer
- [IND]
OpenAI’s Sora shutdown proves hype can’t outrun unit economics
- [IND]
Anthropic’s model shutdown shows safety can bite back
- [IND]
Boy George AI vs Taylor Swift rerecordings
- [IND]
Four music datasets are shaping AI music training
- [IND]
Deezer is right to expose AI music in playlists