Why browser exploit benchmarks prove AI security is already here
Claude Mythos and GPT-5.5 show that autonomous browser exploitation is now a practical AI capability, not a future threat.

Claude Mythos and GPT-5.5 show autonomous browser exploitation is now a practical AI capability.
Autonomous browser exploit development is no longer a lab curiosity, and the Carnegie Mellon benchmark proves it.
That is the real significance of ExploitBench: it does not ask whether a model can spot a bug in a toy setting, it measures whether the model can push a real V8 vulnerability all the way to code execution. Claude Mythos Preview reached the top tier on 21 of 41 vulnerabilities and averaged 9.90 out of 16, while GPT-5.5 landed far behind at 5.51. In fully autonomous mode, Mythos barely slipped, scoring 9.55, which means the model was not just assisted by clever prompt scaffolding. It was doing meaningful exploit work on its own.
First, the benchmark measures the right thing
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Most AI security demos stop at detection, classification, or a proof of concept that never leaves the notebook. That is useful, but it is not the threat model that matters. ExploitBench scores progress across five tiers, up to arbitrary code execution on the target system. That is the line defenders actually care about, because once an attacker can run commands, the browser is no longer a sandbox in any practical sense.

The choice of V8 makes the result harder to dismiss. V8 powers Chrome, Edge, Node.js, and Cloudflare Workers, which means this is not a niche engine in an obscure product. When a benchmark built around V8 says a model can drive a vulnerability to full execution, the implication reaches far beyond one browser tab. It points to a class of agent behavior that maps directly onto real-world attack surfaces used at internet scale every day.
Second, the performance gap is already operationally meaningful
Anthropic’s Claude Mythos Preview did not merely edge out GPT-5.5. It crushed it. Mythos hit the highest tier on 21 vulnerabilities; GPT-5.5 managed two. The autonomous gap was just as stark: 9.55 versus 4.30. That is not a rounding error, and it is not the kind of difference that disappears with a slightly better prompt. It is the difference between an agent that can function like a competent researcher and one that mostly stalls out before the finish line.
Seunghyun Lee, one of the benchmark’s co-authors and an experienced browser vulnerability researcher, reviewed Mythos transcripts and concluded the model behaved like a “fairly competent browser / JS engine security researcher.” That is a serious statement. He also noted cases where Mythos rediscovered a vulnerability technique that humans had dismissed as too complex, and reproduced CVE-2024-0519 after human researchers had failed to break it for over a year. Those examples matter because they show the model is not only following recipes. It is navigating hard exploit paths that demand patience, inference, and persistence.
The second argument is cost, not capability
Mythos is impressive, but the cost profile is brutal. The full Mythos run cost about $36,428 across 122 episodes, while GPT-5.5 via Codex cost roughly $3,075 across 123 episodes. That is about a twelvefold gap. In practice, this means the frontier is not just about who can do the work, but who can afford to do it at scale. Security teams should not comfort themselves with the idea that expensive agents are harmless. Attackers do not need every run to be cheap if one successful run yields a working exploit chain.

The uncomfortable lesson is that cost curves are not a safety moat. The article notes that OpenAI could narrow the gap by throwing more compute at the problem, and that is exactly the direction this field moves in. As models get cheaper and more efficient, the same workflow that now costs tens of thousands of dollars will become accessible to more users, more often. The benchmark is not just a performance ranking. It is a preview of how quickly autonomous exploitation can move from elite research to routine tooling.
The counter-argument
The strongest objection is that these are publicly known vulnerabilities, not novel zero-days, and the benchmark does not prove models can independently discover fresh flaws in the wild. That is true. The dataset includes bugs that may appear in training data, and the authors explicitly say the benchmark does not yet measure finding new flaws or fully weaponizing an exploit for real attacks. On that narrow point, the critics are right to resist overclaiming.
But that limitation does not weaken the alarm. A system that can reliably turn known browser vulnerabilities into code execution is already dangerous, because most real intrusion chains still depend on exploiting known issues faster, more consistently, and at greater scale than defenders can patch. The benchmark is not claiming the last mile of autonomous cyber offense. It is proving that the middle of the attack chain, the part that turns vulnerability knowledge into working exploitation, is now within reach of advanced models.
What to do with this
If you are an engineer, PM, or founder, treat agentic security work as a production risk and a defensive opportunity. Harden browser and JavaScript engine update paths, reduce exploit blast radius, and assume that future attackers will use AI to automate the tedious middle of the chain. At the same time, use these models for internal red teaming, fuzz triage, and exploit analysis under strict controls. The right response is not denial. It is to build systems and workflows that assume autonomous exploit development is already part of the threat landscape.
// Related Articles
- [RSCH]
CRDTs keep replicas in sync without locks
- [RSCH]
Post-Deterministic Systems for Autonomous Infra
- [RSCH]
Causal methods for measuring task learnability
- [RSCH]
RL Training That Hands Off Control Gradually
- [RSCH]
OmniGameArena benchmarks VLM game agents better
- [RSCH]
TurboQuant cuts KV cache memory 6x in Google tests