When AI sees a blank: models can invent medical findings without an image
Wikimedia Commons: Anthropic Claude Opus 4.5📷 © Прикли
- ★Stanford'sPhantom-0setof200imagelessquestionsshowsmodelsretain70-80%ofstandardbenchmarkscores,inventinganatomicaldetailsandclinicalnarratives.
- ★The'mirageinvision'phenomenonisn'tjustanacademiccuriosity–inmedicalandsafetyapplications,fabricateddiagnosescanhaveseriousconsequences.
- ★Existingevaluationframeworkslagbehindmodelsophistication,failingtodetectafundamentalvulnerabilityininputcredibilityassessment.
Leading multimodal AI models are now diagnosing images that were never shown to them—and doing it with unsettling confidence. Stanford research reveals that GPT-5, Google's Gemini 3 Pro, and Anthropic's Claude Opus 4.5 generate detailed medical interpretations and visual descriptions even when fed zero pixels. The kicker? Existing benchmarks designed to catch such failures completely miss the behavior, letting these systems pass evaluations while fabricating nonexistent content.
The Stanford team constructed Phantom-0, a set of 200 imageless prompts about medical scans—X-rays, MRIs, dermatology cases. Models retained 70-80% of their standard benchmark scores despite having no visual input whatsoever. They invented anatomical details, constructed clinical narratives, and deployed convincing medical terminology. One might expect stammering uncertainty or explicit refusals. Instead, the AI produced elaborate, plausible-sounding reports complete with specific observations about structures it never observed.
This "mirage in vision" phenomenon exposes a critical gap between model sophistication and evaluation rigor. The benchmarks that certify these systems for real-world deployment were built to test whether models understand what they do see, not whether they can recognize when they see nothing. It's a distinction with consequences: in medical workflows, a fabricated diagnosis from corrupted or missing image data isn't a curiosity—it's a liability.
Stanfordresearchrevealsleadingmultimodalmodelsconfidentlyfabricatemedicaldiagnoseswithzerovisualinput
Wikimedia Commons: Stanford University📷 © Frank Schulenburg
The root cause likely traces to training data saturated with image-text pairs, where the model learns to generate coherent visual descriptions without firm anchoring to actual pixel input. Under uncertainty, the system defaults to plausible inference rather than honest admission of ignorance. This isn't overconfidence in the human sense; it's a structural feature of how current multimodal architectures process ambiguous inputs.
For developers, the implication is stark: traditional validation pipelines need fundamental redesign. Input credibility assessment—verifying that visual data actually arrived and was processed—must become as standard as output quality checks. Healthcare integrators face the more immediate problem: how to build human-in-the-loop safeguards that catch fabricated interpretations without negating the efficiency gains that attracted them to AI in the first place.
The broader pattern matters beyond medicine. Any domain where multimodal models interpret sensor data, surveillance feeds, or diagnostic imagery carries similar risk. Evaluation frameworks that lag behind model capabilities don't merely underperform—they create dangerous certification gaps, blessing systems with safety credentials they haven't earned. Phantom-0 won't be the last probe of this vulnerability, but it should be the wake-up call.

