OpenAI model beats doctors on diagnosis โ with caveats
Clinicians review anonymized emergency cases beside an AI diagnosis panel.๐ท AI-generated / Tech&Space
- โ A Science study tests diagnostic cases
- โ The model beats doctors in a controlled setting
- โ Rodman warns against premature clinical hype
Adam Rodman, an internist and clinical AI researcher, has spent years thinking about how machines might augment medical judgment. His latest paper in Science delivers a statistically clean result: an OpenAI large language model, fed real-world data from a Boston emergency department, outperformed physicians on case-based diagnostic and clinical reasoning evaluations. The methodology deliberately mirrors a 1959 Science paper that asked how one could determine if a clinical decision support system could diagnose better than humans.
Rodman is not pretending this is a surprise. "And they can do it," he told STAT, with the weary confidence of someone who expected the result but knows what comes after. The model's success on these evaluations is not in doubt. What remains unresolved is whether that success translates into anything a hospital should act upon.
The study's design matters here. It uses real patient data, not the sanitized benchmarks that dominate AI marketing. That choice makes the finding harder to dismiss and harder to celebrate.
The gap between benchmark performance and clinical trust
A clinical pathway places doctor review and safety checks after model output.๐ท AI-generated / Tech&Space
Rodman's central concern is that the results will be misread as evidence of AI's safety and efficacy in live clinical settings. They are not. A model that scores well on retrospective case evaluations has not faced the chaos of an actual ER โ the incomplete histories, the contradictory tests, the patients who lie or don't know their own symptoms. The STAT reporting makes clear that Rodman has been vocal about this distinction, which places him in the awkward position of publishing impressive results while warning against their misuse.
The 1959 echo is telling. That earlier paper asked how to evaluate whether machines could diagnose better than humans; this one answers that they can, under controlled conditions. The question of whether they should be deployed, and under what regulatory and ethical frameworks, remains open. The scientific community's response will shape whether this becomes a genuine inflection point or another cycle of pilot programs and stalled integration.
In other words, the model passed the test. The healthcare system still has to write the rules for what passing means.

