Why AI benchmark wins in cyber should scare defenders

OraCore Editors

Back to home

[RSCH] May 15, 20266 min readOraCore Editors

Why AI benchmark wins in cyber should scare defenders

AI cyber benchmarks now show autonomous capability is advancing faster than defenders are planning for.

Palo Alto Networks Claude Mythos Preview GPT-5.5 AI Security Institute autonomous cyber capability

Share LinkedIn

Why AI benchmark wins in cyber should scare defenders

AI cyber benchmarks now show autonomous capability is advancing faster than defenders are planning for.

That is not a lab curiosity. It is a warning that the gap between model demos and real intrusion work is closing fast, and security teams that still treat AI as a side issue are already behind.

AI is now crossing the line from assistance to autonomy

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The most important detail in the latest findings is not that frontier models can suggest better code or write cleaner phishing lures. It is that Claude Mythos Preview and GPT-5.5 are completing multi-step cyber tasks on their own, in structured ranges that look a lot like the workflow of a real attacker. The UK AI Security Institute said both models outpaced the doubling trend it had been tracking since late 2024, and that the length of cyber tasks models can complete autonomously has doubled on the order of months, not years.

The AISI’s own test cases make the point sharper. Claude Mythos became the first model to complete both of its ranges, solving a 32-step simulated corporate network attack called “The Last Ones” in 6 of 10 attempts and finishing “Cooling Tower,” which no model had previously solved, in 3 of 10 attempts. GPT-5.5 solved “The Last Ones” in 3 of 10 attempts. Those are not perfect scores, but they are good enough to matter because cyber offense does not require perfection to create damage.

The second signal is that independent groups are converging on the same trend

One benchmark can mislead. Two independent research tracks pointing in the same direction are harder to dismiss. Palo Alto Networks reported that it has been testing Claude Mythos, Claude Opus 4.7, and OpenAI’s GPT-5.5-Cyber through launch and trusted-access programs, and said the latest models are “extraordinarily capable at finding vulnerabilities and changing them into critical exploit paths in near-real-time.” That is a direct statement from a security vendor with skin in the game, not a speculative warning from a commentator.

The company’s own output is telling. Palo Alto released security advisories covering 26 CVEs representing 75 issues, identified through AI model scanning across more than 130 products. It said that is far above its typical monthly volume of fewer than five CVEs. Even allowing for the fact that AI-assisted scanning can overproduce leads, the scale of the jump shows why this matters: AI is no longer just helping defenders triage known bugs. It is helping uncover vulnerability chains fast enough to overwhelm normal review cycles.

Security teams are underestimating the speed of the offense cycle

The most dangerous implication of these results is time compression. Palo Alto’s recommended response includes building security operations that can react in minutes, because AI-powered attacks may soon unfold that quickly. That is not alarmist language. It is a sober recognition that the old model of hours-long detection, escalation, and containment is too slow when an attacker can automate reconnaissance, exploit development, and post-exploitation steps in near real time.

There is a practical reason this is such a big shift. Human attackers are constrained by attention, fatigue, and iteration speed. A frontier model can run through candidate paths, discard dead ends, and keep going without losing momentum. When a system can move from vulnerability discovery to critical exploit path construction in one continuous loop, the defender no longer gets the luxury of separating “research time” from “incident time.” Those phases are merging.

The counter-argument

The strongest pushback is that benchmark performance is not the same as operational threat. The AISI itself said the data covers a relatively small number of models, and that the hardest tasks in the suite have the least human comparison data. It also warned that no single benchmark result should be read as a precise measure of AI capability. That caution is right. Cyber ranges are controlled environments, and real networks are messier, more instrumented, and often harder to exploit than tidy simulations.

There is also a legitimate argument that the current results still fall short of fully autonomous compromise at scale. A model solving a task 3 or 6 times out of 10 is not a fully reliable attacker. In many real-world campaigns, reliability matters because failed attempts create logs, trip alarms, and waste opportunity. If the benchmark is too synthetic, the numbers can flatter the models and scare defenders without proving a corresponding jump in live-world breach rates.

That rebuttal does not hold as a reason to relax. The issue is not whether AI has replaced skilled intruders today. The issue is that the slope is steep enough, and the independent measurements are aligned enough, that waiting for perfect proof would be reckless. The AISI said dropping any single model barely changes the estimated doubling time, and METR arrived at nearly the same four-month figure since late 2024. When separate groups, different methods, and different models all point to the same acceleration, the responsible conclusion is not skepticism. It is preparation.

What to do with this

Engineers, PMs, and founders should treat autonomous cyber capability as a product risk, not a future research topic. Assume attackers will use frontier models to find weak points faster than your normal release cadence, then shorten your own response loops to match. Prioritize dependency hygiene, secret management, patch velocity, and detection coverage over feature work that expands attack surface without clear value. If your team cannot identify, patch, and verify critical exposures in days, you are already operating on borrowed time.

// Related Articles

Why AI benchmark wins in cyber should scare defenders

AI is now crossing the line from assistance to autonomy

Get the latest AI news in your inbox

The second signal is that independent groups are converging on the same trend

Security teams are underestimating the speed of the offense cycle

The counter-argument

What to do with this

LLMbda calculus gives agents safety rules

A simpler beamspace denoiser for mmWave MIMO

Why Linux security needs a patch-wave mindset

Judge Reliability Harness Stress-Tests LLM Judges

Taming Black-Box LLM Inference Scheduling

AISafetyBenchExplorer maps AI safety benchmarks