18 AI benchmarks now rank GPT-5.5, Claude, Gemini

OraCore Editors

[RSCH] June 17, 20263 min readOraCore Editors

18 AI benchmarks now rank GPT-5.5, Claude, Gemini

LM Council’s June 2026 benchmark hub compares 30+ models across 18 tests, with fresh scores from Epoch AI, Scale AI, and others.

AI benchmarks

Share LinkedIn

LM Council’s June 2026 hub compares frontier AI models across 18 independent benchmarks.

LM Council updated its AI Model Benchmarks page on June 14, 2026, pulling together 18 tests and 30+ models from sources including Epoch AI and Scale AI. The list tracks names such as GPT-5.5, Claude Opus, Gemini 3, and Grok 4 across reasoning, coding, math, agent tasks, and visual benchmarks.

項目	數值
Benchmarks tracked	18
Models in comparison set	30+
Last updated	June 14, 2026
FrontierMath v2 release date	June 12, 2026

What changed

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The page is not a single leaderboard. It is an interactive comparison tool that lets users pick two models, filter results, and inspect scores across curated datasets such as Humanity’s Last Exam, SWE-bench Verified, GPQA Diamond, FrontierMath, Terminal-Bench 2.0, and GeoBench.

Several benchmarks on the page are independently run, which means the numbers may differ from vendor-reported results. LM Council also notes that the benchmark set is curated by AI Explained, with a focus on widely watched tests rather than model-maker marketing claims.

Humanity’s Last Exam: Gemini 3.1 Pro Preview leads at 46.4% ±2.0.
SWE-bench Verified: Claude Opus 4.7 (max) tops the chart at 83.5% ±1.7.
GPQA Diamond: GPT-5.4 Pro (xhigh) leads at 94.6% ±1.6.
FrontierMath Tiers 1-3 v2: GPT-5.5 Pro (xhigh) scores 87.7% ±1.9.

Other results show a split by task type. Claude models do well on coding and terminal work, GPT-5 variants lead several math and knowledge tests, and Gemini 3 models post strong scores in visual physics and geography tasks.

Why it matters

For developers, the page offers a quick way to compare model fit by workload instead of relying on one headline score. A model that wins on math may lag on code fixes, while another may do better on long-context or terminal-based agent tasks.

For buyers and teams watching the market, the hub makes it easier to spot where frontier models are separating from one another and where the gaps are small enough that price, latency, or tool support may matter more than raw benchmark rank.

The main takeaway is that June 2026 model selection looks less like choosing a single “best” model and more like matching the benchmark to the job.

// Related Articles

18 AI benchmarks now rank GPT-5.5, Claude, Gemini

What changed

Get the latest AI news in your inbox

Why it matters

ArXiv AI papers push agents, memory, and data

ReproRepo scales reproducibility audits with GitHub issues

Variable-Width Transformers cut wasted capacity

VERITAS lets robots verify and improve at runtime

Phase noise makes massive MIMO information age

Exact posterior scores for inverse problems