ARTICLE LINK> OPENING ARTICLE STREAM> WARMING IMAGE CACHE> LOCKING READER ROUTE> TRANSFER

// INITIALIZING GLOBE FEED...

AIREWRITTENdb#4264

The costliest AI mistake may be the broken problem a model refuses to question

May 17, 2026(1w ago)

San Francisco, CA

Quick article interpreter

A group of 64 mathematicians has built SOOHAK, a benchmark designed to test research-level mathematical reasoning rather than polished contest performance. The results show frontier models can solve some difficult problems, with Gemini 3 Pro leading the Challenge set at 30%, but they remain weak at identifying tasks with no valid solution. That matters because real research often includes bad assumptions, missing constraints, and dead ends, not just neatly packaged puzzles. The next signal to watch is whether model makers improve refusal accuracy instead of merely scaling answer generation.

SOOHAK Shows AI Can Calculate, But Still Struggles to Stop📷 AI-generated image / TECH&SPACE

AuthorNexus ValeAI editor“Can smell synthetic confidence before the first paragraph ends.”

★SOOHAK includes 340 valid Challenge tasks and 99 deliberately flawed Refusal tasks.
★Gemini 3 Pro leads the Challenge set at 30%, ahead of GPT-5 at 26% and Claude Opus 4.5 at 10%.
★No model crosses 50% on the Refusal set, where the correct answer is to identify the flaw in the problem.

AI math progress has lately been sold through shiny milestones: Olympiad-level scores, elegant proofs, and the usual implication that research automation is just around the corner. SOOHAK is useful because it asks a less flattering question: can a model tell when the problem itself is broken?

The benchmark was developed by 64 mathematicians and contains 439 handwritten tasks. Of those, 340 sit in a Challenge set, while 99 are deliberately flawed Refusal problems where the right move is not to compute harder, but to identify the defect. According to the reported setup, contributors included professors, PhD students, postdocs, and IMO medalists, with no AI assistance used in writing the tasks.

On the solving side, the numbers are respectable but not magical. Google’s Gemini 3 Pro reportedly leads the Challenge set at 30%, followed by GPT-5 at 26% and Claude Opus 4.5 at 10%. That is progress, but it is not the same thing as broad mathematical research ability, unless one’s definition of research politely ignores most failed starts.

The math benchmark tests whether models can reject broken problems instead of dressing up bad premises

📷 AI-generated image / TECH&SPACE

The sharper result is in the refusal test. No model reaches 50% accuracy on SOOHAK’s broken problems, and the benchmark only gives credit when a system spots and names the flaw instead of confidently producing an answer. The reported findings also suggest that extra compute helps models solve more valid tasks, but does not reliably improve their ability to recognize invalid ones.

That distinction matters for developers and research teams. A model that pushes through ambiguity can look productive in demos, but in technical workflows it can waste time by laundering a false premise into a polished derivation. In math, code, science, and engineering, “no solution under these assumptions” is not a failure mode. It is often the answer.

SOOHAK also reframes the benchmark race. The competitive advantage may not belong to the model that generates the longest proof fastest, but to the one that can stop, inspect the premises, and decline the task for the right reason. The real signal here is not whether AI can sound like a mathematician on a clean prompt, but whether it can survive the messy part where the prompt is wrong.

TECH&SPACE editorial infographic — Compact benchmark diagram contrasting Challenge solving with Refusal recognition, using short English labels and the numbers 340, 99, 30%, 26%, 10%, <50%.📷 AI-generated image / TECH&SPACE

GPT-5 Claude Gemini Google AI Benchmarking

// Next from latest and related signals

Gaza Turns Rubble Into Interlocking Shelter Blocks

AI Doctor Before Oncology Consult Can Reduce Patient Stress

The useful AI doctor may be the one patients meet before the real appointment

// liked by readers

//Comments

Uredi u foto-review →

ARTICLE LINK> OPENING ARTICLE STREAM> WARMING IMAGE CACHE> LOCKING READER ROUTE> TRANSFER

// INITIALIZING GLOBE FEED...

🇭🇷 HR

AIREWRITTENdb#4264

The costliest AI mistake may be the broken problem a model refuses to question

May 17, 2026(1w ago)

San Francisco, CA

The Decoder

Quick article interpreter

SOOHAK Shows AI Can Calculate, But Still Struggles to Stop📷 AI-generated image / TECH&SPACE

AuthorNexus ValeAI editor“Can smell synthetic confidence before the first paragraph ends.”

★SOOHAK includes 340 valid Challenge tasks and 99 deliberately flawed Refusal tasks.
★Gemini 3 Pro leads the Challenge set at 30%, ahead of GPT-5 at 26% and Claude Opus 4.5 at 10%.
★No model crosses 50% on the Refusal set, where the correct answer is to identify the flaw in the problem.

The math benchmark tests whether models can reject broken problems instead of dressing up bad premises

📷 AI-generated image / TECH&SPACE

GPT-5 Claude Gemini Google AI Benchmarking

// Next from latest and related signals

The useful AI doctor may be the one patients meet before the real appointment

// liked by readers

//Comments

Uredi u foto-review →

The costliest AI mistake may be the broken problem a model refuses to question

// Next from latest and related signals

When rebuilding is blocked, Gaza is turning rubble into shelter material

The useful AI doctor may be the one patients meet before the real appointment

//Comments

The costliest AI mistake may be the broken problem a model refuses to question

// Next from latest and related signals

When rebuilding is blocked, Gaza is turning rubble into shelter material

The useful AI doctor may be the one patients meet before the real appointment

//Comments