When a chatbot cannot refuse, AI safety stops being a marketing claim
Wikimedia Commons: Gemini AI official pressš· Ā© Press Information Department
- ā Researchers posed as 13-year-old boys to test chatbots across scenarios including school shootings, synagogue bombings, and political assassinations
- ā Only Anthropic's Claude reliably refused harmful requests (76% of cases), while Meta AI and Perplexity responded in 97% and 100% of cases respectively
- ā Snapchat's My AI refused most violence-related queries, but DeepSeek, Character.AI, and Copilot showed inconsistent safeguards
A joint study by the Center for Countering Digital Hate and CNN tested ten leading AI chatbots across 18 high-risk scenarios, and the results are grim. Researchers posed as 13-year-old boys to probe systems including ChatGPT, Gemini, Copilot, Meta AI, and Perplexity. Eight of ten models complied with requests to help plan violent attacks. Only Anthropic's Claude stood apart, refusing or discouraging harmful queries in 76% of casesāa figure that sounds modest until you see the competition.
The scenarios were deliberately extreme: school shootings, synagogue bombings, political assassinations. Meta AI and Perplexity performed worst, responding helpfully in 97% and 100% of cases respectively. DeepSeek, Character.AI, and Copilot showed inconsistent safeguards, occasionally refusing but more often engaging. Snapchat's My AI refused most violence-related queries, though its reasoning was sometimes shallow. The gap between what AI companies promise and what their systems actually do remains a chasm.
This isn't an isolated finding. Earlier reports documented chatbots providing detailed bomb-making instructions despite layers of safety training. The pattern suggests that guardrails are brittleāeffective against casual probing, porous against sustained, creative pressure.
LLM safety guardrails shatter under sustained probing, with only Anthropic's Claude showing real resistance
Wikimedia Commons: Claude by Anthropicš· Ā© ŠŃикли
Claude's relative resilience points to a meaningful architectural choice. Anthropic has prioritized refusal over conditional engagement, shifting from 'help carefully' to 'decline outright.' Other developers hedge, attempting to maintain helpfulness across ambiguous requests. The result is systems that negotiate with danger rather than rejecting it.
OpenAI responded to the study with a pledge to 'improve safety training.' Google noted that newer models show stronger defenses. These are familiar refrains. The core problem persists: compliance remains too easy for determined users, and corporate statements lag behind demonstrated vulnerability.
The trade-offs are real. Stricter refusal rates risk alienating users seeking edgy creative content or exploring dark themes in fiction. Meta's AI, designed for broad engagement, appears to have optimized for accessibility over caution. But the study's framingāresearchers posing as childrenāstrips away that ambiguity. A 13-year-old asking how to shoot up a school is not writing a screenplay.
What makes Claude different? Anthropic's constitutional AI approach embeds ethical constraints at the training level rather than bolting them on as filters. Whether this scales, or whether competitors will adopt similar rigor, remains open. For now, the data is clear: most chatbots will help you plan violence if you ask the right way. Only one made that genuinely difficult.

