A troubling trend has emerged in the rapidly evolving landscape of artificial intelligence: AI chatbots and research tools are regularly citing material from retracted scientific papers without disclosing their problematic status. Recent studies have revealed that systems like ChatGPT and specialized research AI tools are failing to recognize when papers have been withdrawn from the scientific record, potentially spreading discredited research to millions of users. This discovery raises fundamental questions about the reliability of AI systems that are increasingly being used for medical advice, scientific research, and educational purposes across the globe.
Recent research has exposed a significant flaw in how AI systems evaluate scientific literature. When researchers tested OpenAI’s ChatGPT using questions based on 21 retracted papers about medical imaging, they found that the AI referenced these problematic papers in five cases, providing caution in only three instances. More concerning still, a separate study examining 217 retracted and low-quality papers across multiple scientific fields found that none of the AI chatbot’s responses mentioned retractions or quality concerns.
The issue extends far beyond ChatGPT. When MIT Technology Review tested specialized AI research tools including Elicit, Ai2 ScholarQA, Perplexity, and Consensus using the same retracted papers, the results were alarming. Elicit referenced five retracted papers, while Ai2 ScholarQA cited 17, Perplexity 11, and Consensus 18, all without noting their retraction status.
This problem is particularly concerning given the growing reliance on AI for scientific and medical information. Weikuan Gu, a medical researcher at the University of Tennessee in Memphis and author of one of the studies, explains the severity: “The chatbot is using a real paper, real material, to tell you something. But if people only look at the content of the answer and do not click through to the paper and see that it’s been retracted, that’s really a problem.”
The implications of this issue extend far beyond academic curiosity. The public is increasingly turning to AI chatbots for medical advice and health condition diagnoses. Students and scientists are using AI-focused tools to review existing scientific literature and summarize research papers. The US National Science Foundation has invested $75 million in building AI models specifically for science research, highlighting the technology’s growing role in academic and medical decision-making.
Yuanxi Fu, an information science researcher at the University of Illinois Urbana-Champaign, emphasizes the importance of quality control: “If a tool is facing the general public, then using retraction as a kind of quality indicator is very important. There’s kind of an agreement that retracted papers have been struck off the record of science, and the people who are outside of science should be warned that these are retracted papers.”
The stakes become even higher when considering that retracted papers often contain fundamental flaws in methodology, data, or conclusions. These papers have been formally withdrawn from the scientific record because they were found to be unreliable, fraudulent, or dangerous. When AI systems present this information as credible without appropriate warnings, they risk perpetuating harmful misinformation.
Recognizing the severity of this issue, some companies have started implementing corrective measures. Christian Salem, cofounder of Consensus, acknowledged the problem: “Until recently, we didn’t have great retraction data in our search engine.” His company has now begun using retraction data from multiple sources, including publishers, data aggregators, independent web crawling, and Retraction Watch, which maintains a database of scientific retractions.
When tested again in August, Consensus showed improvement, citing only five retracted papers compared to 18 in earlier tests. Similarly, Elicit announced that it removes retracted papers flagged by the scholarly research catalogue OpenAlex and is working on aggregating additional sources of retractions.
However, the responses have been inconsistent across the industry. Ai2 told researchers that its tool currently does not automatically detect or remove retracted papers. Perplexity’s response was dismissive, stating that it “does not ever claim to be 100% accurate.” The varied approaches highlight the lack of industry standards for handling retracted scientific literature.
The reliance on retraction databases faces significant limitations. Ivan Oransky, cofounder of Retraction Watch, cautions against viewing any database as comprehensive: “The reason it’s resource intensive is because someone has to do it all by hand if you want it to be accurate.” Creating a truly comprehensive retraction database would require more resources than any single organization possesses.
Publishers compound this problem by using inconsistent approaches to retraction notices. Caitlin Bakker from the University of Regina, an expert in research and discovery tools, explains: “Where things are retracted, they can be marked as such in very different ways. Correction, expression of concern, erratum, and retracted are among some labels publishers may add to research papers, and these labels can be added for many reasons.”
Additional complications arise from the distributed nature of academic publishing. Researchers often share papers on preprint servers, repositories, and personal websites, creating multiple copies scattered across the internet. The AI models’ training data may not reflect real-time updates to retraction status, particularly if a paper is retracted after the model’s training cutoff date.
The problem is deeply rooted in how AI models are trained and designed. Most text-to-image and text-to-video models are trained on massive datasets containing billions of text and image pairings scraped from the internet. This training data represents a snapshot of online content at a particular time, without ongoing updates to reflect retractions or quality assessments.
Aaron Tay, a librarian at Singapore Management University, notes that “most academic search engines don’t do a real-time check against retraction data, so you are at the mercy of how accurate their corpus is.” This temporal disconnect means that even well-intentioned AI systems may be working with outdated or incomplete information about paper quality.
Furthermore, the current generation of AI models lacks the contextual understanding necessary to evaluate scientific credibility. While they can process and synthesize information effectively, they cannot independently assess the methodological rigor or reliability of research findings.
Experts suggest several approaches to address this crisis. Ivan Oransky and other researchers advocate for making more contextual information available to AI models during training and operation. This could include peer reviews, editorial comments, and critiques from platforms like PubPeer alongside published papers.
Many publishers already make retraction notices freely available, publishing them as separate articles linked to the original paper and placing them outside paywalls. Google’s Nature and the BMJ, for example, follow this practice. Fu suggests that AI companies need to effectively utilize such information, as well as news articles in their training data that mention paper retractions.
Companies could also implement real-time verification systems that check retraction databases before presenting scientific information to users. While computationally expensive, such systems could significantly improve the reliability of AI-generated responses about scientific topics.
The discovery that AI systems regularly cite retracted scientific papers represents a fundamental challenge to the trustworthiness of artificial intelligence in scientific and medical domains. As these systems become more integrated into research workflows and public information seeking, the need for robust quality control mechanisms becomes critical. While some companies are beginning to address the issue, the inconsistent responses and underlying systemic challenges suggest that this problem will require coordinated effort from AI developers, publishers, and regulatory bodies.
The ultimate question facing the AI industry is whether these systems can evolve quickly enough to match the trust that users are placing in them. As Aaron Tay wisely notes: “We are at the very, very early stages, and essentially you have to be skeptical.” Until comprehensive solutions are implemented, users of AI systems must remain vigilant about the potential for encountering discredited research, particularly when making important decisions about health or