The costliest AI mistake may be the broken problem a model refuses to question
SOOHAK Shows AI Can Calculate, But Still Struggles to Stop📷 AI-generated image / TECH&SPACE
- ★SOOHAK includes 340 valid Challenge tasks and 99 deliberately flawed Refusal tasks.
- ★Gemini 3 Pro leads the Challenge set at 30%, ahead of GPT-5 at 26% and Claude Opus 4.5 at 10%.
- ★No model crosses 50% on the Refusal set, where the correct answer is to identify the flaw in the problem.
AI math progress has lately been sold through shiny milestones: Olympiad-level scores, elegant proofs, and the usual implication that research automation is just around the corner. SOOHAK is useful because it asks a less flattering question: can a model tell when the problem itself is broken?
The benchmark was developed by 64 mathematicians and contains 439 handwritten tasks. Of those, 340 sit in a Challenge set, while 99 are deliberately flawed Refusal problems where the right move is not to compute harder, but to identify the defect. According to the reported setup, contributors included professors, PhD students, postdocs, and IMO medalists, with no AI assistance used in writing the tasks.
On the solving side, the numbers are respectable but not magical. Google’s Gemini 3 Pro reportedly leads the Challenge set at 30%, followed by GPT-5 at 26% and Claude Opus 4.5 at 10%. That is progress, but it is not the same thing as broad mathematical research ability, unless one’s definition of research politely ignores most failed starts.
The math benchmark tests whether models can reject broken problems instead of dressing up bad premises
📷 AI-generated image / TECH&SPACE
The sharper result is in the refusal test. No model reaches 50% accuracy on SOOHAK’s broken problems, and the benchmark only gives credit when a system spots and names the flaw instead of confidently producing an answer. The reported findings also suggest that extra compute helps models solve more valid tasks, but does not reliably improve their ability to recognize invalid ones.
That distinction matters for developers and research teams. A model that pushes through ambiguity can look productive in demos, but in technical workflows it can waste time by laundering a false premise into a polished derivation. In math, code, science, and engineering, “no solution under these assumptions” is not a failure mode. It is often the answer.
SOOHAK also reframes the benchmark race. The competitive advantage may not belong to the model that generates the longest proof fastest, but to the one that can stop, inspect the premises, and decline the task for the right reason. The real signal here is not whether AI can sound like a mathematician on a clean prompt, but whether it can survive the messy part where the prompt is wrong.

