Cursor Composer 2 Bets on Agentic Coding
Cursor’s Composer 2 posts 61.3 on CursorBench and 61.7 on Terminal-Bench 2.0, with pricing aimed at high-volume coding teams.

Cursor says its new Composer 2 model hit 61.3 on CursorBench and 61.7 on Terminal-Bench 2.0, which is a serious jump for a coding model that is meant to edit files, run tests, and keep going. The bigger story is simple: AI coding tools are moving from autocomplete into agents that can plan work and finish it inside the IDE.
That matters because the people buying these tools do not care about chatbot flair. They care about pull requests closed per week, fewer context switches, and whether the model can survive a real repo with flaky tests and ugly legacy code. Composer 2 is Cursor’s answer to that pressure.
What Cursor actually launched
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Cursor, the AI coding company under Anysphere, introduced Composer 2 as its newest agentic coding model on March 19, 2026. The pitch is straightforward: this model is built for developer workflows, not general chat. It can inspect code, edit multiple files, call tools, and keep working across longer tasks.
The launch matters because Cursor already has a large audience inside its editor, so the company can test how well a model behaves in production-like coding sessions instead of only in benchmark suites. That gives Composer 2 a built-in feedback loop that most model makers do not have.
Here are the headline details Cursor published:
- CursorBench score: 61.3
- Terminal-Bench 2.0 score: 61.7
- SWE-bench Multilingual: 73.7
- Standard pricing: $0.50 per 1,000 input tokens and $2.50 per 1,000 output tokens
- Fast variant: higher throughput at 5x the price
Those numbers place Composer 2 in a very specific category. It is not trying to be a general-purpose assistant that writes poems, drafts emails, and answers trivia. It is trying to be a coding worker that can stay inside a task until the repo is in better shape.
Why the architecture matters
Cursor says Composer 2 keeps the mixture-of-experts design used in earlier versions, then adds more training on long-horizon coding tasks. That means the model does not activate every parameter for every token. Instead, it routes work through a smaller set of experts, which can reduce compute cost and keep responses fast.
That design choice is important because agentic coding is expensive. A model has to read files, reason about dependencies, write patches, inspect logs, and sometimes retry when tests fail. If each step is slow, the whole workflow becomes annoying. If the model is fast enough, it feels like a junior engineer working in the background.
Cursor also says the model was trained on sandboxed coding environments with reinforcement learning. Sasha Rush, who works on the model, described the approach in public discussion as using RL to align expert routers with real developer workflows. In plain English: the model was taught to behave more like a coding agent and less like a generic text generator.
- Mixture-of-experts reduces the active compute per token
- Sandbox training teaches tool use, file edits, and test execution
- Long-horizon tasks reward agents that keep state across steps
- IDE integration gives the model direct access to terminals and worktrees
This is where Composer 2 starts to separate itself from models that only look strong in chat demos. Coding agents fail in boring ways: they edit the wrong file, forget a previous instruction, or stop halfway through a refactor. A model trained around actual repo work has a better shot at avoiding those mistakes.
Benchmarks, pricing, and the real comparison
Cursor’s launch post makes a clear argument: Composer 2 is fast enough and cheap enough to compete with larger frontier models on coding tasks. The company says the model improved 38 percent over Composer 1.5 on CursorBench, while posting a 61.7 score on Terminal-Bench 2.0. For teams paying by token, the pricing is the other half of the story.
At $0.50 per 1,000 input tokens and $2.50 per 1,000 output tokens, Composer 2 is positioned below many frontier API prices that developers already use for coding help. Cursor also offers a fast variant that triples throughput, but at 5x the cost. That is a useful option for teams that care more about turnaround time than raw token efficiency.
Here is the practical comparison buyers will make:
- Composer 2 standard: $0.50 input, $2.50 output per 1,000 tokens
- Composer 2 fast: 3x throughput, 5x price
- Composer 1.5: lower benchmark scores, less capable on long tasks
- GPT-5 and Claude Opus class models: often stronger general reasoning, usually higher cost for coding workflows
The catch is that benchmark numbers do not tell the whole story. Cursor has not published every raw run, seed, or hardware detail for these results. That does not make the scores meaningless, but it does mean independent replication will matter before anyone treats Composer 2 as the default choice for serious production work.
“The model is only as good as the workflow around it.” — Andrej Karpathy, X post, 2023
Karpathy’s line fits this launch well. A model can score well and still frustrate developers if the surrounding product gets in the way. Cursor’s advantage is that it owns the editor, the terminal integration, and the agent loop, so it can shape the full experience instead of just selling API access.
Why enterprises are paying attention
Cursor is already inside large engineering organizations, and that gives Composer 2 an audience that cares about measurable output. Tom’s Hardware reported that NVIDIA has more than 30,000 internal Cursor seats, and the company has said its code output has tripled versus pre-AI baselines. That is the kind of adoption number that makes procurement teams look twice.
Enterprise interest comes from a few practical features. Cursor supports audit logs, sandboxed terminals, isolated worktrees, and commit signing. Those controls matter in regulated environments where teams need to know what the agent touched and how changes reached the repo.
For teams comparing agentic coding tools, the real question is how much work the model can complete without human intervention. Cursor’s setup is designed for parallelism, so one agent can refactor code while another writes tests and a third updates docs. That is much more useful than a single chat window that spits out a patch and waits for the user to do the rest.
- 30,000+ Cursor seats at NVIDIA, per Tom’s Hardware
- Triple code output claim versus pre-AI baselines
- Audit logs for review and incident tracing
- Sandboxed execution to contain risky actions
Still, enterprise adoption will depend on whether Composer 2 can handle the messier parts of software work: flaky CI, partial migrations, and repos where half the logic lives in side effects. A strong demo helps, but a stable month in production matters more.
The verification gap is still the story
Cursor’s launch is impressive, but the missing pieces are just as important. The company has not published enough detail for outsiders to fully reproduce its Harbor runs or compare wall-clock performance across identical hardware. That leaves room for skepticism, especially because coding benchmarks often reward narrow optimizations.
There is also a difference between a model that scores well and a model that saves time. Developers care about how many tokens it burns, how often it retries, how long it takes to finish a fix, and whether the result passes tests on the first or second attempt. Those are the numbers that decide whether a team keeps paying for the tool.
Independent labs will likely test Composer 2 against OpenAI systems, Anthropic models, and Google DeepMind offerings over the next few weeks. That comparison will matter more than the launch thread, because it will show whether Cursor’s model is strong only inside its own editor or strong in the broader coding market too.
For now, the smartest move for teams is to run a small pilot on a non-critical repository. Measure completion rate, test pass rate, latency, and token spend. If Composer 2 really saves time on multi-file refactors and debugging, the numbers will show up quickly. If it does not, the gap between benchmark claims and day-to-day value will be obvious just as fast.
What to watch next
Composer 2 feels like a serious bet on agentic coding, and Cursor has enough product control to make that bet matter. The model’s pricing, benchmark scores, and IDE integration all point in the same direction: coding assistants are becoming execution tools, not just suggestion engines.
My read is that the next phase will not be about whether these agents can write code at all. It will be about how much of a task they can finish before a human has to step in. If Cursor can publish stronger transparency data and third-party tests confirm the gains, Composer 2 could become a default option for teams that live in large codebases.
Until then, the right question is simple: does Composer 2 save your team enough time to justify replacing the model you already trust? Run the pilot, inspect the logs, and compare the real cost per merged change. That is the only benchmark that counts.
// Related Articles
- [MODEL]
Gemini 1.5 Pro-002, Flash-002 and 2.0 Flash update Google AI
- [MODEL]
MiniMax M3 Proves Open-Weight Can Still Win on Coding
- [MODEL]
Gemini 3.5 Flash Pricing, Context, Benchmarks
- [MODEL]
Gemma 4 12B: Specs, Benchmarks & How to Run It Locally
- [MODEL]
Best Kimi Models in 2026: K2.5 vs K2 Thinking
- [MODEL]
Kimi K2.6 adds open-source coding and agent swarm