Tag
SWE-Bench
SWE-bench is a benchmark for measuring whether models and coding agents can fix real GitHub issues end to end. Its variants, including Verified and Lite, are used to compare bug localization, test-driven edits, and the cost of agentic repair workflows.
8 articles

Kimi K2.6 and Qwen 3.6 Narrow the Gap
Kimi K2.6 and Qwen 3.6 are open-weight models that now rival closed models on coding and agent tasks.

AI Coding Agents Burn 1000x More Tokens Than Chat
A study of SWE-bench Verified shows agentic coding can consume 1000x more tokens than chat, with costs driven by inputs and hard to predict.

Qwen3.6-27B opens a smaller, sharper path to coding
Qwen3.6-27B is a 27B dense multimodal model that beats Qwen3.5-397B-A17B on key coding benchmarks while staying easier to deploy.

Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks
Anthropic’s unreleased Mythos Preview beats GPT-5.4 and Gemini 3.1 Pro on coding, math, and agent tests, led by 97.6% on USAMO.

I Tested Devin on 10 Tasks. It Finished 3.
Devin scored 13.86% on SWE-bench and finished 3 of 10 real tasks in one test, showing where AI coding agents still fall short.

Gemini 3.1 Pro: Google’s new top model in numbers
Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

GLM-5: Z.AI's new flagship for coding and agents
GLM-5 posts 77.8 on SWE-bench Verified and 56.2 on Terminal Bench 2.0, putting Z.AI in direct competition with top coding models.

Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents
Xiaomi’s MiMo-V2-Pro packs 1T parameters, 42B active, and 1M context, with SWE-bench results close to Claude Sonnet 4.6.