ChatGPT may hallucinate less, but trust still needs proof
The headline number is useful only if the evaluation context is legible.š· Generated editorial visual / Tech&Space
- ā 52.5% fewer hallucinated claims claimed
- ā Internal evaluations lack transparency
- ā High-stakes fields still skeptical
OpenAIās latest gambit to curb ChatGPTās tendency to invent facts arrives in the form of GPT-5.5 Instant, now the default model for all users. The company claims the update delivers "significant improvements in factuality across the board," with a 52.5% reduction in hallucinated claimsāa statistic derived from its own, undisclosed internal evaluations. For an AI whose confabulations have led to everything from legal misinformation to medical misdiagnoses, the promise of greater reliability is a welcome one. But in an industry where benchmarks are often as opaque as the models themselves, the absence of third-party scrutiny makes the claim feel more like a press release than a breakthrough.
The timing is no coincidence. As competitors like Anthropic and Google push their own "hallucination-resistant" models, OpenAI is under pressure to prove its technology isnāt just fast, but trustworthy. Yet the companyās refusal to clarify which model served as the baseline for its 52.5% figureāor how exactly these evaluations were conductedāleaves a gaping hole in its credibility. Are we seeing genuine progress, or just a more polished version of the same old flaws? The answer, as ever, lies in the fine print that OpenAI isnāt sharing.
Fewer wrong answers sounds excellent; the question is how clearly the test was measured.
For default models, benchmark claims become product claims almost immediately.š· Generated editorial visual / Tech&Space
The Vergeās coverage highlights the stakes: in fields like finance and healthcare, even a 10% error rate can be catastrophic. A 52.5% improvement sounds impressiveāuntil you realize itās measured against an unknown standard. For now, the upgrade feels less like a solution and more like a bet that users wonāt notice the difference.
The source material also shows that openAIās history of overpromising and under-delivering on reliability doesnāt help its case. The companyās previous models, including GPT-4, were hailed as game-changersāuntil researchers and users alike uncovered persistent issues with factual accuracy. GPT-5.5 Instantās rollout follows a familiar pattern: a flashy announcement, a bold statistic, and a vague assurance that things are "better." Whatās missing is the kind of granular, reproducible testing that would allow independent experts to verify the claims. Without it, the upgrade risks being dismissed as another incremental tweak dressed up as a revolution.
The real-world implications are harder to ignore. In sectors where AI-assisted decision-making is already creeping inālike legal research or medical diagnosticsāhallucinations arenāt just embarrassing; theyāre dangerous. OpenAIās framing suggests GPT-5.5 Instant is a step toward addressing these concerns, but the lack of transparency undermines its own argument. If the model is truly more factual, why not let outsiders put it to the test?
For now, the update serves as a reminder of the gap between AI marketing and AI reality. OpenAIās internal benchmarks may show progress, but until those numbers are backed by external validation, theyāre just numbers. The question isnāt whether GPT-5.5 Instant hallucinates lessāitās whether anyone outside OpenAI can prove it.
For source context, compare The Verge, NIST AI RMF and OECD AI Principles.

