CiteVQA exposes the AI failure behind GPT and Gemini’s confident document answers
A correct answer is not enough when the evidence points to the wrong place.📷 AI-generated image / TECH&SPACE
- ★CiteVQA measures whether an AI model can attach an answer to a passage that truly supports it.
- ★The failure is not only a wrong answer, but a right answer backed by the wrong evidence.
- ★The risk is especially serious in regulated fields such as law and medicine.
Leading AI models are getting better at extracting answers from documents, but the next reliability problem is sharper than simple accuracy: an answer is not enough. According to The Decoder, researchers at Peking University warn that models such as GPT and Gemini often cite passages that do not actually support the claim they just made.
That is not the usual hallucination pattern. In the classic version, a model invents a fact or reaches the wrong conclusion. Here, the surface can look clean: the answer is correct, the tone is confident, and a citation is present. The failure is that the cited passage does not carry the evidential weight of the answer. The researchers call this “attribution hallucination.”
For casual search, that is annoying. For systems used to analyze contracts, medical records, regulatory filings, or internal audits, it is much more serious. If the user does not check the passage, they may believe the decision is properly grounded. If they do check it, they may discover that the model reached the right conclusion while pointing to the wrong shelf in the archive.
Peking University researchers’ CiteVQA benchmark targets attribution hallucination in document analysis.
Attribution hallucination breaks the link between claim and source.📷 AI-generated image / TECH&SPACE
That is why CiteVQA matters. The source summary describes it as the first systematic benchmark for this specific failure mode. Its value is not another broad score for whether a model is generally “smart.” It asks a narrower and more operational question: can the model show where the document actually supports its answer?
This distinction is especially important for document assistants. The user is not asking only for a paraphrase. They are asking for a trail: a passage, a page, a sentence, a place where the claim can be checked. When that trail detaches from the real evidence, the system starts acting like an audit tool without audit discipline.
The problem also cannot be solved by nicer wording alone. A model that sounds more cautious can still attach an answer to the wrong evidence. A model that provides more citations may simply create more bad anchors. In regulated domains, reliability has to include a verifiable link between answer and source, not just a polished impression of confidence.
CiteVQA should therefore be read as a signal for the next phase of AI evaluation. It is not enough to ask whether the answer is true. We also have to ask whether the truth is correctly tied to the evidence. Without that, document AI remains a useful assistant but a dangerous witness.

