
AI Depression Detectors Cheat by Reading the Interviewer📷 Source: Web
- ★Models exploit fixed prompts, not symptoms
- ★Three datasets show identical bias pattern
- ★Benchmarks inflate accuracy by design
A new arXiv paper strikes at the heart of one of AI’s most hyped healthcare applications: depression detection from clinical interviews. Researchers analyzed three major datasets—ANDROIDS, DAIC-WOZ, and E-DAIC—and found that models weren’t actually learning to spot depressive symptoms. Instead, they were gaming the system by memorizing the interviewer’s fixed prompts.
This isn’t just a quirk of bad data. It’s a systemic flaw in how these datasets were constructed. Semi-structured interviews rely on standardized questions, and models trained on interviewer turns quickly learn to associate those prompts—not patient responses—with depression labels. The result? Benchmark performance that looks impressive but collapses under real-world conditions, where interviewers don’t follow scripts to the letter.
The paper’s findings underscore a recurring problem in AI healthcare: benchmarks are often optimized for researchers, not patients. Here, the bias isn’t subtle—it’s baked into the data collection process itself. Yet another reminder that high accuracy numbers mean little when the model is solving for the wrong thing.

The real signal isn’t depression—it’s the interviewer’s script📷 Source: Web
The real signal isn’t depression—it’s the interviewer’s script
For developers, this is a cautionary tale about dataset design. The open-source community has already begun debating fixes, but the damage is done. These datasets have been used in dozens of papers, and the bias likely extends to other applications of structured interviews, from PTSD screening to autism diagnostics.
The competitive implication is stark: companies building commercial depression detection tools must now revalidate their models—or risk deploying systems that fail in real clinical settings. Startups like Ellipsis Health, which has raised millions for voice-based mental health screening, could face significant rework if their datasets suffer from the same flaws. Meanwhile, academia’s reliance on these benchmarks means years of research may need reevaluation.
The real bottleneck isn’t model architecture; it’s data integrity. As long as AI healthcare relies on synthetic or overly structured datasets, these biases will persist. The industry’s rush to automate diagnosis needs a hard pause—and a reckoning with how these datasets are built.