Tag
AI benchmarks
AI benchmarks measure how models perform on reasoning, knowledge QA, coding, and long-context tasks. Scores from tests like ARC-AGI-2, GPQA, and MMLU help compare new releases, track real progress, and expose trade-offs between capability, cost, and reliability.
3 articles

Research/Apr 17
Stanford’s 2026 AI Index, explained with charts
Stanford’s 2026 AI Index shows faster adoption, rising costs, and thin US-China gaps. The charts tell a messier story than the hype.

Model Releases/Apr 3
Gemini 3.1 Pro: Google’s new top model in numbers
Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

Model Releases/Apr 2
GPT-5.4 vs Claude Opus 4.6: 75% Win Rate
We tested GPT-5.4, Claude Opus 4.6, DeepSeek V4, and Gemini 3.1 across 12 benchmarks. One model won 9 of them.