TECH & SPACE
PROHR
Space Tracker
// INITIALIZING GLOBE FEED...
MedicineREWRITTENdb#3628

OpenAI model beats doctors on diagnosis โ€” with caveats

(1d ago)
Boston, Massachusetts, United States
STAT News
Quick article interpreter

Researchers led by internist Adam Rodman published a study in Science showing an OpenAI large language model outperforming physicians in diagnostic evaluations using real-world data from a Boston emergency department. The findings carry weight for AI's role in healthcare, but Rodman has been explicit that outperforming doctors on case-based tests does not prove the model is safe or effective in actual clinical workflows. The study intentionally echoes a 1959 Science paper that established how to evaluate whether clinical decision support systems can surpass human diagnostic performance. What comes next depends on whether the healthcare industry treats this as a research milestone or a product green light.

Clinicians review anonymized emergency cases beside an AI diagnosis panel.๐Ÿ“ท AI-generated / Tech&Space

Dr. Elara Voss
AuthorDr. Elara VossMedicine editor"Never confuses promising data with actual care."
  • โ˜…A Science study tests diagnostic cases
  • โ˜…The model beats doctors in a controlled setting
  • โ˜…Rodman warns against premature clinical hype

Adam Rodman, an internist and clinical AI researcher, has spent years thinking about how machines might augment medical judgment. His latest paper in Science delivers a statistically clean result: an OpenAI large language model, fed real-world data from a Boston emergency department, outperformed physicians on case-based diagnostic and clinical reasoning evaluations. The methodology deliberately mirrors a 1959 Science paper that asked how one could determine if a clinical decision support system could diagnose better than humans.

Rodman is not pretending this is a surprise. "And they can do it," he told STAT, with the weary confidence of someone who expected the result but knows what comes after. The model's success on these evaluations is not in doubt. What remains unresolved is whether that success translates into anything a hospital should act upon.

The study's design matters here. It uses real patient data, not the sanitized benchmarks that dominate AI marketing. That choice makes the finding harder to dismiss and harder to celebrate.

The gap between benchmark performance and clinical trust

A clinical pathway places doctor review and safety checks after model output.๐Ÿ“ท AI-generated / Tech&Space

Rodman's central concern is that the results will be misread as evidence of AI's safety and efficacy in live clinical settings. They are not. A model that scores well on retrospective case evaluations has not faced the chaos of an actual ER โ€” the incomplete histories, the contradictory tests, the patients who lie or don't know their own symptoms. The STAT reporting makes clear that Rodman has been vocal about this distinction, which places him in the awkward position of publishing impressive results while warning against their misuse.

The 1959 echo is telling. That earlier paper asked how to evaluate whether machines could diagnose better than humans; this one answers that they can, under controlled conditions. The question of whether they should be deployed, and under what regulatory and ethical frameworks, remains open. The scientific community's response will shape whether this becomes a genuine inflection point or another cycle of pilot programs and stalled integration.

In other words, the model passed the test. The healthcare system still has to write the rules for what passing means.

// Continue in this category

// liked by readers

//Comments

โŠž Foto Review