Cursor Composer 2 Bets on Agentic Coding

OraCore Editors

[MODEL] March 28, 20269 min readOraCore Editors

Cursor Composer 2 Bets on Agentic Coding

Cursor’s Composer 2 posts 61.3 on CursorBench and 61.7 on Terminal-Bench 2.0, with pricing aimed at high-volume coding teams.

Terminal-Bench 2.0 mixture-of-experts CursorBench Cursor Composer 2 Terminal Bench 2.0 agentic coding

Share LinkedIn

Cursor Composer 2 Bets on Agentic Coding

Cursor says its new Composer 2 model hit 61.3 on CursorBench and 61.7 on Terminal-Bench 2.0, which is a serious jump for a coding model that is meant to edit files, run tests, and keep going. The bigger story is simple: AI coding tools are moving from autocomplete into agents that can plan work and finish it inside the IDE.

That matters because the people buying these tools do not care about chatbot flair. They care about pull requests closed per week, fewer context switches, and whether the model can survive a real repo with flaky tests and ugly legacy code. Composer 2 is Cursor’s answer to that pressure.

What Cursor actually launched

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Cursor, the AI coding company under Anysphere, introduced Composer 2 as its newest agentic coding model on March 19, 2026. The pitch is straightforward: this model is built for developer workflows, not general chat. It can inspect code, edit multiple files, call tools, and keep working across longer tasks.

The launch matters because Cursor already has a large audience inside its editor, so the company can test how well a model behaves in production-like coding sessions instead of only in benchmark suites. That gives Composer 2 a built-in feedback loop that most model makers do not have.

Here are the headline details Cursor published:

CursorBench score: 61.3
Terminal-Bench 2.0 score: 61.7
SWE-bench Multilingual: 73.7
Standard pricing: $0.50 per 1,000 input tokens and $2.50 per 1,000 output tokens
Fast variant: higher throughput at 5x the price

Those numbers place Composer 2 in a very specific category. It is not trying to be a general-purpose assistant that writes poems, drafts emails, and answers trivia. It is trying to be a coding worker that can stay inside a task until the repo is in better shape.

Why the architecture matters

Cursor says Composer 2 keeps the mixture-of-experts design used in earlier versions, then adds more training on long-horizon coding tasks. That means the model does not activate every parameter for every token. Instead, it routes work through a smaller set of experts, which can reduce compute cost and keep responses fast.

That design choice is important because agentic coding is expensive. A model has to read files, reason about dependencies, write patches, inspect logs, and sometimes retry when tests fail. If each step is slow, the whole workflow becomes annoying. If the model is fast enough, it feels like a junior engineer working in the background.

Cursor also says the model was trained on sandboxed coding environments with reinforcement learning. Sasha Rush, who works on the model, described the approach in public discussion as using RL to align expert routers with real developer workflows. In plain English: the model was taught to behave more like a coding agent and less like a generic text generator.

Mixture-of-experts reduces the active compute per token
Sandbox training teaches tool use, file edits, and test execution
Long-horizon tasks reward agents that keep state across steps
IDE integration gives the model direct access to terminals and worktrees

This is where Composer 2 starts to separate itself from models that only look strong in chat demos. Coding agents fail in boring ways: they edit the wrong file, forget a previous instruction, or stop halfway through a refactor. A model trained around actual repo work has a better shot at avoiding those mistakes.

Benchmarks, pricing, and the real comparison

Cursor’s launch post makes a clear argument: Composer 2 is fast enough and cheap enough to compete with larger frontier models on coding tasks. The company says the model improved 38 percent over Composer 1.5 on CursorBench, while posting a 61.7 score on Terminal-Bench 2.0. For teams paying by token, the pricing is the other half of the story.

At $0.50 per 1,000 input tokens and $2.50 per 1,000 output tokens, Composer 2 is positioned below many frontier API prices that developers already use for coding help. Cursor also offers a fast variant that triples throughput, but at 5x the cost. That is a useful option for teams that care more about turnaround time than raw token efficiency.

Here is the practical comparison buyers will make:

Composer 2 standard: $0.50 input, $2.50 output per 1,000 tokens
Composer 2 fast: 3x throughput, 5x price
Composer 1.5: lower benchmark scores, less capable on long tasks
GPT-5 and Claude Opus class models: often stronger general reasoning, usually higher cost for coding workflows

The catch is that benchmark numbers do not tell the whole story. Cursor has not published every raw run, seed, or hardware detail for these results. That does not make the scores meaningless, but it does mean independent replication will matter before anyone treats Composer 2 as the default choice for serious production work.

“The model is only as good as the workflow around it.” — Andrej Karpathy, X post, 2023

Karpathy’s line fits this launch well. A model can score well and still frustrate developers if the surrounding product gets in the way. Cursor’s advantage is that it owns the editor, the terminal integration, and the agent loop, so it can shape the full experience instead of just selling API access.

Why enterprises are paying attention

Cursor is already inside large engineering organizations, and that gives Composer 2 an audience that cares about measurable output. Tom’s Hardware reported that NVIDIA has more than 30,000 internal Cursor seats, and the company has said its code output has tripled versus pre-AI baselines. That is the kind of adoption number that makes procurement teams look twice.

Enterprise interest comes from a few practical features. Cursor supports audit logs, sandboxed terminals, isolated worktrees, and commit signing. Those controls matter in regulated environments where teams need to know what the agent touched and how changes reached the repo.

For teams comparing agentic coding tools, the real question is how much work the model can complete without human intervention. Cursor’s setup is designed for parallelism, so one agent can refactor code while another writes tests and a third updates docs. That is much more useful than a single chat window that spits out a patch and waits for the user to do the rest.

30,000+ Cursor seats at NVIDIA, per Tom’s Hardware
Triple code output claim versus pre-AI baselines
Audit logs for review and incident tracing
Sandboxed execution to contain risky actions

Still, enterprise adoption will depend on whether Composer 2 can handle the messier parts of software work: flaky CI, partial migrations, and repos where half the logic lives in side effects. A strong demo helps, but a stable month in production matters more.

The verification gap is still the story

Cursor’s launch is impressive, but the missing pieces are just as important. The company has not published enough detail for outsiders to fully reproduce its Harbor runs or compare wall-clock performance across identical hardware. That leaves room for skepticism, especially because coding benchmarks often reward narrow optimizations.

There is also a difference between a model that scores well and a model that saves time. Developers care about how many tokens it burns, how often it retries, how long it takes to finish a fix, and whether the result passes tests on the first or second attempt. Those are the numbers that decide whether a team keeps paying for the tool.

Independent labs will likely test Composer 2 against OpenAI systems, Anthropic models, and Google DeepMind offerings over the next few weeks. That comparison will matter more than the launch thread, because it will show whether Cursor’s model is strong only inside its own editor or strong in the broader coding market too.

For now, the smartest move for teams is to run a small pilot on a non-critical repository. Measure completion rate, test pass rate, latency, and token spend. If Composer 2 really saves time on multi-file refactors and debugging, the numbers will show up quickly. If it does not, the gap between benchmark claims and day-to-day value will be obvious just as fast.

What to watch next

Composer 2 feels like a serious bet on agentic coding, and Cursor has enough product control to make that bet matter. The model’s pricing, benchmark scores, and IDE integration all point in the same direction: coding assistants are becoming execution tools, not just suggestion engines.

My read is that the next phase will not be about whether these agents can write code at all. It will be about how much of a task they can finish before a human has to step in. If Cursor can publish stronger transparency data and third-party tests confirm the gains, Composer 2 could become a default option for teams that live in large codebases.

Until then, the right question is simple: does Composer 2 save your team enough time to justify replacing the model you already trust? Run the pilot, inspect the logs, and compare the real cost per merged change. That is the only benchmark that counts.

// Related Articles

Cursor Composer 2 Bets on Agentic Coding

What Cursor actually launched

Get the latest AI news in your inbox

Why the architecture matters

Benchmarks, pricing, and the real comparison

Why enterprises are paying attention

The verification gap is still the story

What to watch next

Gemini 1.5 Pro-002, Flash-002 and 2.0 Flash update Google AI

MiniMax M3 Proves Open-Weight Can Still Win on Coding

Gemini 3.5 Flash Pricing, Context, Benchmarks

Gemma 4 12B: Specs, Benchmarks & How to Run It Locally

Best Kimi Models in 2026: K2.5 vs K2 Thinking

Kimi K2.6 adds open-source coding and agent swarm