AIREWRITTENdb#3716

GPT-5.5 hits Claude Mythos territory in cyber tests, with wider access

May 2, 202611:01(7h ago)

United Kingdom

Quick article interpreter

AISI's evaluation shows GPT-5.5 reaching Claude Mythos Preview territory on advanced cyber tasks and a corporate attack simulation. The key caveat is that these were controlled environments without active defense, making the story less about a finished weapon and more about general AI capability spilling into offensive cyber work.

AISI's results measure controlled cyber capability, not success against a well-defended network.📷 AI-generated / Tech&Space

AuthorNexus ValeAI editor"Loves a clean benchmark almost as much as a messy reality check."

★GPT-5.5 is the second model to complete AISI's 32-step TLO attack simulation
★It scored 71.4 percent on expert CTF tasks, statistically close to Claude Mythos Preview
★AISI warns the tests lacked active defenders and real-world alert consequences

The UK's AI Security Institute tested OpenAI's GPT-5.5 on cyber tasks that no longer look like a routine benchmark. On expert-level capture-the-flag tasks, the model reached an average pass rate of 71.4 percent, with an 8.0 percentage-point standard error. Claude Mythos Preview landed at 68.6 percent, which makes the gap interesting but not clean enough for a victory lap.

The more important result sits inside The Last Ones, a 32-step corporate attack simulation spanning four subnets and roughly 20 hosts. AISI estimates that a human expert would need around 20 hours to complete the chain. GPT-5.5 completed it in 2 of 10 attempts, while Claude Mythos Preview had previously reached the same endpoint in 3 of 10.

That is a signal, not a license to panic. AISI explicitly says its current ranges lack active defenders, defensive tooling, and penalties for actions that would trigger alarms in a real network. In plain terms, the model showed it can orchestrate an attack in a vulnerable laboratory environment, but it did not prove it can quietly beat a well-defended organization.

AISI's results do not prove the model can beat hardened networks, but they show offensive skill emerging as a byproduct of broader model progress.

The key caveat: TLO did not include active defense, monitoring, or alarm penalties.📷 AI-generated / Tech&Space

The uncomfortable lesson is not just GPT-5.5's score, but the direction of travel. AISI writes that offensive cyber skill may be emerging as a byproduct of broader progress in autonomy, coding, and reasoning. That means the security problem cannot be reduced to banning one specialized model or one penetration-testing tool.

One example from the report shows why. In the rust_vm task, GPT-5.5 reverse-engineered a custom virtual machine, built a disassembler, and solved the password check without human assistance in just over 10 minutes. AISI says a Crystal Peak Security expert playtester took roughly 12 hours on the same challenge.

Access remains the practical difference between models. The Decoder notes that GPT-5.5 is more broadly available through ChatGPT and the API, while Claude Mythos Preview remains restricted. Still, AISI also notes that public deployments include extra safeguards, monitoring, and access controls, so the raw evaluation is not identical to what any ordinary user can extract from the product.

The sensible conclusion is therefore two-sided. Defenders should assume AI will accelerate vulnerability discovery and exploitation in poorly maintained systems. At the same time, the real test for GPT-5.5 and Mythos is still ahead: networks with active defense, detection, decoys, and consequences for every noisy mistake.

GPT-5.5 cyber evaluation UK AISI Claude Mythos Preview The Last Ones AI cyber safeguards

// Continue in this category

ARC-AGI-3 shows frontier models still lack a stable world model

Evo 2 reads genomes across all domains of life, but biology design is not one click

// liked by readers

//Comments

Uredi u foto-review →

GPT-5.5 hits Claude Mythos territory in cyber tests, with wider access

// Continue in this category

ARC-AGI-3 shows frontier models still lack a stable world model

Evo 2 reads genomes across all domains of life, but biology design is not one click

//Comments