Tag

SWE-Bench

SWE-bench is a benchmark for measuring whether models and coding agents can fix real GitHub issues end to end. Its variants, including Verified and Lite, are used to compare bug localization, test-driven edits, and the cost of agentic repair workflows.

8 articles

Model Releases/May 4

Kimi K2.6 and Qwen 3.6 Narrow the Gap

Kimi K2.6 and Qwen 3.6 are open-weight models that now rival closed models on coding and agent tasks.

Research/Apr 27

AI Coding Agents Burn 1000x More Tokens Than Chat

A study of SWE-bench Verified shows agentic coding can consume 1000x more tokens than chat, with costs driven by inputs and hard to predict.

Model Releases/Apr 27

Qwen3.6-27B opens a smaller, sharper path to coding

Qwen3.6-27B is a 27B dense multimodal model that beats Qwen3.5-397B-A17B on key coding benchmarks while staying easier to deploy.

Model Releases/Apr 13

Claude Mythos Preview Tops GPT-5.4 on Key Benchmarks

Anthropic’s unreleased Mythos Preview beats GPT-5.4 and Gemini 3.1 Pro on coding, math, and agent tests, led by 97.6% on USAMO.

AI Agent/Apr 3

I Tested Devin on 10 Tasks. It Finished 3.

Devin scored 13.86% on SWE-bench and finished 3 of 10 real tasks in one test, showing where AI coding agents still fall short.

Model Releases/Apr 3

Gemini 3.1 Pro: Google’s new top model in numbers

Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

Model Releases/Apr 2

GLM-5: Z.AI's new flagship for coding and agents

GLM-5 posts 77.8 on SWE-bench Verified and 56.2 on Terminal Bench 2.0, putting Z.AI in direct competition with top coding models.

Model Releases/Mar 28

Xiaomi MiMo-V2-Pro: 1T MoE Model for Agents

Xiaomi’s MiMo-V2-Pro packs 1T parameters, 42B active, and 1M context, with SWE-bench results close to Claude Sonnet 4.6.