18 AI benchmarks now rank GPT-5.5, Claude, Gemini
LM Council’s June 2026 benchmark hub compares 30+ models across 18 tests, with fresh scores from Epoch AI, Scale AI, and others.

LM Council’s June 2026 hub compares frontier AI models across 18 independent benchmarks.
LM Council updated its AI Model Benchmarks page on June 14, 2026, pulling together 18 tests and 30+ models from sources including Epoch AI and Scale AI. The list tracks names such as GPT-5.5, Claude Opus, Gemini 3, and Grok 4 across reasoning, coding, math, agent tasks, and visual benchmarks.
| 項目 | 數值 |
|---|---|
| Benchmarks tracked | 18 |
| Models in comparison set | 30+ |
| Last updated | June 14, 2026 |
| FrontierMath v2 release date | June 12, 2026 |
What changed
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The page is not a single leaderboard. It is an interactive comparison tool that lets users pick two models, filter results, and inspect scores across curated datasets such as Humanity’s Last Exam, SWE-bench Verified, GPQA Diamond, FrontierMath, Terminal-Bench 2.0, and GeoBench.

Several benchmarks on the page are independently run, which means the numbers may differ from vendor-reported results. LM Council also notes that the benchmark set is curated by AI Explained, with a focus on widely watched tests rather than model-maker marketing claims.
- Humanity’s Last Exam: Gemini 3.1 Pro Preview leads at 46.4% ±2.0.
- SWE-bench Verified: Claude Opus 4.7 (max) tops the chart at 83.5% ±1.7.
- GPQA Diamond: GPT-5.4 Pro (xhigh) leads at 94.6% ±1.6.
- FrontierMath Tiers 1-3 v2: GPT-5.5 Pro (xhigh) scores 87.7% ±1.9.
Other results show a split by task type. Claude models do well on coding and terminal work, GPT-5 variants lead several math and knowledge tests, and Gemini 3 models post strong scores in visual physics and geography tasks.
Why it matters
For developers, the page offers a quick way to compare model fit by workload instead of relying on one headline score. A model that wins on math may lag on code fixes, while another may do better on long-context or terminal-based agent tasks.

For buyers and teams watching the market, the hub makes it easier to spot where frontier models are separating from one another and where the gaps are small enough that price, latency, or tool support may matter more than raw benchmark rank.
The main takeaway is that June 2026 model selection looks less like choosing a single “best” model and more like matching the benchmark to the job.
// Related Articles
- [RSCH]
ArXiv AI papers push agents, memory, and data
- [RSCH]
ReproRepo scales reproducibility audits with GitHub issues
- [RSCH]
Variable-Width Transformers cut wasted capacity
- [RSCH]
VERITAS lets robots verify and improve at runtime
- [RSCH]
Phase noise makes massive MIMO information age
- [RSCH]
Exact posterior scores for inverse problems