GPT-5.5 hits Claude Mythos territory in cyber tests, with wider access
AISI's results measure controlled cyber capability, not success against a well-defended network.๐ท AI-generated / Tech&Space
- โ GPT-5.5 is the second model to complete AISI's 32-step TLO attack simulation
- โ It scored 71.4 percent on expert CTF tasks, statistically close to Claude Mythos Preview
- โ AISI warns the tests lacked active defenders and real-world alert consequences
The UK's AI Security Institute tested OpenAI's GPT-5.5 on cyber tasks that no longer look like a routine benchmark. On expert-level capture-the-flag tasks, the model reached an average pass rate of 71.4 percent, with an 8.0 percentage-point standard error. Claude Mythos Preview landed at 68.6 percent, which makes the gap interesting but not clean enough for a victory lap.
The more important result sits inside The Last Ones, a 32-step corporate attack simulation spanning four subnets and roughly 20 hosts. AISI estimates that a human expert would need around 20 hours to complete the chain. GPT-5.5 completed it in 2 of 10 attempts, while Claude Mythos Preview had previously reached the same endpoint in 3 of 10.
That is a signal, not a license to panic. AISI explicitly says its current ranges lack active defenders, defensive tooling, and penalties for actions that would trigger alarms in a real network. In plain terms, the model showed it can orchestrate an attack in a vulnerable laboratory environment, but it did not prove it can quietly beat a well-defended organization.
AISI's results do not prove the model can beat hardened networks, but they show offensive skill emerging as a byproduct of broader model progress.
The key caveat: TLO did not include active defense, monitoring, or alarm penalties.๐ท AI-generated / Tech&Space
The uncomfortable lesson is not just GPT-5.5's score, but the direction of travel. AISI writes that offensive cyber skill may be emerging as a byproduct of broader progress in autonomy, coding, and reasoning. That means the security problem cannot be reduced to banning one specialized model or one penetration-testing tool.
One example from the report shows why. In the rust_vm task, GPT-5.5 reverse-engineered a custom virtual machine, built a disassembler, and solved the password check without human assistance in just over 10 minutes. AISI says a Crystal Peak Security expert playtester took roughly 12 hours on the same challenge.
Access remains the practical difference between models. The Decoder notes that GPT-5.5 is more broadly available through ChatGPT and the API, while Claude Mythos Preview remains restricted. Still, AISI also notes that public deployments include extra safeguards, monitoring, and access controls, so the raw evaluation is not identical to what any ordinary user can extract from the product.
The sensible conclusion is therefore two-sided. Defenders should assume AI will accelerate vulnerability discovery and exploitation in poorly maintained systems. At the same time, the real test for GPT-5.5 and Mythos is still ahead: networks with active defense, detection, decoys, and consequences for every noisy mistake.

