The ChatGPT, a generative pre-trained transformer chatbot developed by OpenAI, has recently been making waves with its impressive capabilities. From answering questions to generating content, its potential applications in healthcare and education have been a topic of exploration and debate within these fields. However, amid the excitement, concerns have emerged regarding its accuracy in providing scientific references.
In fact, certain journals, like Science, have gone so far as to ban chatbot-generated text in their published reports. This has led to an investigation aimed at quantifying ChatGPT’s citation error rate.
Investigating ChatGPT’s Citation Accuracy
A team of researchers from the Learning Health Community in Palo Alto, California, embarked on a study to gauge the utility of ChatGPT as a research copilot. The research focused on assessing its ability to generate content for learning health systems (LHS).
Engaging with the latest GPT-4 model from OpenAI between April 20 and May 6, 2023, the researchers delved into a wide spectrum of LHS topics. These encompassed both broad subjects like LHS and data, as well as specific themes, such as building a stroke risk prediction model using the XGBoost library.
Importantly, the study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
Fact-Checking Cited Journal Articles
To scrutinise ChatGPT’s accuracy in citing journal articles, the researchers employed a meticulous approach. Each cited journal article was subjected to a thorough verification process. This involved confirming the article’s existence in the cited journal and cross-referencing its title on Google Scholar.
Various details, including the article’s title, authors, publication year, volume, issue, and page numbers, were meticulously compared. Any article that failed this verification was flagged as fake.
To establish a dependable error rate, the study involved the examination of over 300 article references related to LHS topics. In addition, the researchers conducted interactions with OpenAI’s default GPT-3.5 model using identical LHS topics. The study relied on exact 95% confidence intervals for the error rates, and the statistical significance was assessed using the Fisher exact test, with a significance level set at P < .05.
EDITOR’S PAGE | ADVISORY BOARD | NEWS | PRODUCTS | FEATURE ARTICLE | CLINICAL | PROFILE | EXHIBITIONS & CONFERENCES | PRODUCT TIPS | DENTAL BUSINESS
Startling Findings Unveiled
The results of the investigation unveiled some startling insights. From the default GPT-3.5 model, a staggering 98.1% (95% CI, 94.7%-99.6%) of the 162 reference journal articles subjected to fact-checking were identified as fake articles. In contrast, when evaluating the GPT-4 model, out of 257 fact-checked articles, 20.6% (95% CI, 15.8%-26.1%) were found to be fake. While the error rate for reference citing with GPT-4 was significantly lower than that of GPT-3.5 (P < .001), it remained noteworthy, particularly in narrower subject areas.
Implications and Discussion
The implications of these findings are significant. It suggests that GPT-4, while showing promise as a research copilot for LHS education and training, has limitations. Specifically, it falls short in delivering the latest information and occasionally cites fake journal articles. Consequently, human verification remains essential, and references cited by GPT-3.5 should be treated with caution.
Moreover, ChatGPT itself has offered insights into why it may return fake references, citing factors like unreliable training data and the model’s challenges in distinguishing reliable from unreliable sources. As generative chatbots become integrated into healthcare education and training, it becomes vital to understand their capabilities, including their inability to fact-check responses. Additionally, ethical concerns, such as misinformation and data bias, should be carefully considered when deploying GPT technology in healthcare settings.
The information and viewpoints presented in the above news piece or article do not necessarily reflect the official stance or policy of Dental Resource Asia or the DRA Journal. While we strive to ensure the accuracy of our content, Dental Resource Asia (DRA) or DRA Journal cannot guarantee the constant correctness, comprehensiveness, or timeliness of all the information contained within this website or journal.
Please be aware that all product details, product specifications, and data on this website or journal may be modified without prior notice in order to enhance reliability, functionality, design, or for other reasons.
The content contributed by our bloggers or authors represents their personal opinions and is not intended to defame or discredit any religion, ethnic group, club, organisation, company, individual, or any entity or individual.