Comparing five generative AI chatbots’ answers to LLM-generated clinical questions with medical information scientists’ evidence summaries

Authors

Mallory N. Blasingame Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0003-0356-9481
Taneya Y. Koonce Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0002-4014-467X
Annette M. Williams Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0002-2526-3857
Jing Su Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0001-6699-6806
Dario A. Giuse Department of Biomedical Informatics, Vanderbilt University School of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States https://orcid.org/0000-0002-2677-6734
Poppy A. Krump Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0002-3081-6487
Nunzia B. Giuse Center for Knowledge Management, Department of Biomedical Informatics, and Department of Medicine, Vanderbilt University Medical Center, Nashville, TN https://orcid.org/0000-0002-7644-9803

DOI:

https://doi.org/10.5195/jmla.2026.2333

Keywords:

Large Language Models, LLMs, Generative AI, Chatbots, Artificial Intelligence, Evidence Synthesis, Library Science, Information Science, Biomedical Informatics

Abstract

Objective: To compare answers to clinical questions between five publicly available large language model (LLM) chatbots and information scientists.

Methods: LLMs were prompted to provide 45 PICO (patient, intervention, comparison, outcome) questions addressing treatment, prognosis, and etiology. Each question was answered by a medical information scientist and submitted to five LLM tools: ChatGPT, Gemini, Copilot, DeepSeek, and Grok-3. Key elements from the answers provided were used by pairs of information scientists to label each LLM answer as in Total Alignment, Partial Alignment, or No Alignment with the information scientist. The Partial Alignment answers were also analyzed for the inclusion of additional information.

Results: The entire LLM set of answers, 225 in total, were assessed as being in Total Alignment 20.9% of the time (n=47), in Partial Alignment 78.7% of the time (n=177), and in No Alignment 0.4% of the time (n=1). Kruskal-Wallis testing found no significant performance difference in alignment ratings between the five chatbots (p=0.46). An analysis of the partially aligned answers found a significant difference in the number of additional elements provided by the information scientists versus the chatbots per Wilcoxon-Rank Sum testing (p=0.02).

Discussion: Five chatbots did not differ significantly in their alignment with information scientists’ evidence summaries. The analysis of partially aligned answers found both chatbots and information scientists included additional information, with information scientists doing so significantly more often. An important next step will be to assess the additional information, both from the chatbots and the information scientists for validity and relevance.

References

1. Presiado M, Montero A, Lopes L, Hamel L. KFF health misinformation tracking poll: artificial intelligence and health information [Internet]. Kaiser Family Foundation; 15 Aug 2024 [cited 23 Sept 2025]. https://www.kff.org/health-misinformation-and-trust/poll-finding/kff-health-misinformation-tracking-poll-artificial-intelligence-and-health-information/.

2. Chapekis A, Lieb A, Shah S, Smith A. What web browsing data tells us about how AI appears online [Internet]. Pew Research Center; 23 May 2025 [cited 23 Sept 2025]. https://www.pewresearch.org/data-labs/2025/05/23/what-web-browsing-data-tells-us-about-how-ai-appears-online/.

3. Taylor J, Dagan K, Youngberg M, Kaufman T, Radding J. A survey of AI tools in library tech: accelerating into and unlocking streamlined enhanced convenient empowering game-changers. J Electron Resour Librariansh.2025 May;1–14. DOI: https://doi.org/10.1080/1941126X.2025.2497738.

4. Livingston L, Featherstone-Uwague A, Barry A, Barretto K, Morey T, Herrmannova D, Avula V. Reproducible generative artificial intelligence evaluation for health care: a clinician-in-the-loop approach. JAMIA Open. 2025 June;8(3):ooaf054. DOI: https://doi.org/10.1093/jamiaopen/ooaf054.

5. Yau JY, Saadat S, Hsu E, Murphy LS, Roh JS, Suchard J, Tapia A, Wiechmann W, Langdorf MI. Accuracy of prospective assessments of 4 large language model chatbot responses to patient questions about emergency care: experimental comparative study. J Med Internet Res. 2024 Nov 4;26:e60291. DOI: https://doi.org/10.2196/60291.

6. Sundar KR. When patients arrive with answers. JAMA. 2025 Aug;334(8):672-3. DOI: https://doi.org/10.1001/jama.2025.10678.

7. Ashraf AR, Mackey TK, Fittler A. Search engines and generative artificial intelligence integration: public health risks and recommendations to safeguard consumers online. JMIR Public Health Surveill. 2024 Mar;10:e53086. DOI: https://doi.org/10.2196/53086.

8. Eichenberger A, Thielke S, Van Buskirk A. A case of bromism influenced by use of artificial intelligence. Ann Intern Med Clin Cases. 2025 Aug;4(8):e241260. DOI: https://doi.org/10.7326/aimcc.2024.1260.

9. Wang X, Cohen RA. Health information technology use among adults: United States, July-December 2022 [Internet]. NCHS Data Brief No. 482. Hyattsville, MD: National Center for Health Statistics; 2023 [cited 23 Sept 2025]. https://www.cdc.gov/nchs/products/databriefs/db482.htm.

10. DeSalvo K. Google’s impact on health [Internet]. Mountain View, CA: Google; Feb 2025 [cited 17 Oct 2025]. https://services.google.com/fh/files/misc/googles_health_impact.pdf.

11. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2025 Jan 28;333(4):319–28. DOI: https://doi.org/10.1001/jama.2024.21700.

12. Adam GP, DeYoung J, Paul A, Saldanha IJ, Balk EM, Trikalinos TA, Wallace BC. Literature search sandbox: a large language model that generates search queries for systematic reviews. JAMIA Open. 2024;7(3):ooae098. DOI: https://doi.org/10.1093/jamiaopen/ooae098.

13. Bourgeois JP, Ellingson H. Ability of ChatGPT to generate systematic review search strategies compared to a published search strategy. Med Ref Serv Q. 2025 Jul-Sep:44(3):279-291. DOI: https://doi.org/10.1080/02763869.2025.2537075.

14. Wang S, Scells H, Koopman B, Zuccon G. Can ChatGPT write a good Boolean query for systematic review literature search? In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval [Internet]. Taipei, Taiwan: ACM; 2023 [cited 23 Sept 2025]. p. 1426–36. https://dl.acm.org/doi/10.1145/3539618.3591703.

15. Akinseloyin O, Jiang X, Palade V. A question-answering framework for automated abstract screening using large language models. J Am Med Inform Assoc. 2024 Sept 1;31(9):1939-1952. DOI: https://doi.org/10.1093/jamia/ocae166.

16. Lieberum JL, Toews M, Metzendorf MI, Heilmeyer F, Siemens W, Haverkamp C, Böhringer D, Meerpohl JJ, Eisele-Metzger A. Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review. J Clin Epidemiol. 2025 May;181:111746. DOI: https://doi.org/10.1016/j.jclinepi.2025.111746.

17. Department of Biomedical Informatics Generative AI at VUMC [Internet]. Vanderbilt University Medical Center; [cited 23 Sept 2025]. https://www.vumc.org/dbmi/GenerativeAI.

18. Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a large language model’s ability to answer clinicians’ requests for evidence summaries. J Med Libr Assoc. 2025 Jan 14;113(1):65–77. DOI: https://doi.org/10.5195/jmla.2025.1985.

19. Koonce TY, Williams AM, Giuse DA, Su J, Blasingame MN, Krump PA, Giuse NB. A multi-model evaluation: harnessing generative AI to understand the state-of-the-art of literature search automation. Medical Library Association Annual Meeting, Pittsburgh, PA; Apr 2025.

20. The CHART Collaborative; Huo B, Collins GS, Chartash D, Thirunavukarasu AJ, Flanagin A, et al. Reporting guideline for chatbot health advice studies: the CHART statement. JAMA Netw Open. 2025 Aug 1;8(8):e2530220. DOI: https://doi.org/10.1001/jamanetworkopen.2025.30220.

21. CHART Collaborative. Reporting guidelines for chatbot health advice studies: explanation and elaboration for the Chatbot Assessment Reporting Tool (CHART). BMJ. 2025 Aug 1;390:e083305. DOI: https://doi.org/10.1136/bmj-2024-083305.

22. Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: a scoping review on benefits, challenges and applications. Int J Med Inf. 2024 Aug;188:105474. DOI: https://doi.org/10.1016/j.ijmedinf.2024.105474.

23. OpenAI. ChatGPT free tier FAQ [Internet]. OpenAI Help Center; [cited 2025 Aug 1]. https://help.openai.com/en/articles/9275245-chatgpt-free-tier-faq.

24. Kavukcuoglu K. Gemini 2.0 is now available to everyone [Internet]. Google; 5 Feb 2025 [cited 23 Sept 2025]. https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/.

25. DeepSeek-AI, Guo D, Yang D, Zhang H, Song J, Zhang R, et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948 [Preprint]. arXiv; 2025 [cited 23 Sept 2025]. Available from: http://arxiv.org/abs/2501.12948.

26. xAI. Grok 3 Beta — the age of reasoning agents [Internet]. xAI; 19 Feb 2025 [cited 23 Sept 2025]. https://x.ai/news/grok-3.

27. Spataro J. Available today: GPT-5 in Microsoft 365 Copilot [Internet]. Microsoft; 7 Aug 2025 [cited 17 Oct 2025]. https://www.microsoft.com/en-us/microsoft-365/blog/2025/08/07/available-today-gpt-5-in-microsoft-365-copilot/.

28. GovTech Data Science & AI Division. Prompt engineering playbook (Beta v3) [Internet]. Government of Singapore; 30 Aug 2023 [cited 23 Sept 2025]. https://www.developer.tech.gov.sg/products/collections/data-science-and-artificial-intelligence/playbooks/prompt-engineering-playbook-beta-v3.pdf.

29. Schardt C, Adams MB, Owens T, Keitz S, Fontelo P. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Med Inform Decis Mak. 2007 June 15;7(1):16. DOI: https://doi.org/10.1186/1472-6947-7-16.

30. Richardson WS, Wilson MC, Nishikawa J, Hayward RS. The well-built clinical question: a key to evidence-based decisions. ACP J Club. 1995;123(3):A12-13.

31. American Diabetes Association Professional Practice Committee for Diabetes. 2. Diagnosis and Classification of Diabetes: Standards of Care in Diabetes-2026. Diabetes Care. 2026 Jan 1;49(Suppl 1):S27-S49. DOI: https://doi.org/10.2337/dc26-S002.

32. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009 Apr;42(2):377–81. DOI: https://doi.org/10.1016/j.jbi.2008.08.010.

33. Harris PA, Taylor R, Minor BL, Elliott V, Fernandez M, O’Neal L, McLeod L, Delacqua G, Delacqua F, Kirby J, Duda SN, REDCap Consortium. The REDCap consortium: building an international community of software platform partners. J Biomed Inform. 2019 July;95:103208. DOI: https://doi.org/10.1016/j.jbi.2019.103208.

34. Tao D, Kochendorfer KM, Griffin T, McCrary Q, Gautam A, Labib BSR, Arvan M, Flynn J, Jiang K. “Can ChatGPT answer patient’s questions?”: a preliminary analysis. Stud Health Technol Inform. 2025 Aug 7;329:1586–7. DOI: https://doi.org/10.3233/shti251114.

35. Fox ZE, Williams AM, Blasingame MN, Koonce TY, Kusnoor SV, Su J, Lee P, Epelbaum MI, Naylor HM, DesAutels SJ, Frakes ET, Giuse NB. Why equating all evidence searches to systematic reviews defies their role in information seeking. J Med Libr Assoc. 2019 Oct 1;107(4):613–7. DOI: https://doi.org/10.5195/jmla.2019.707.

36. Lin CR, Chen YJ, Tsai PA, Hsieh WY, Tsai SHL, Fu TS, Lai PL, Chen JY. Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. Arch Osteoporos. 2025 Sept 8;20(1):120. DOI: https://doi.org/10.1007/s11657-025-01587-4

37. Flaharty KA, Hu P, Hanchard SL, Ripper ME, Duong D, Waikel RL, Solomon BD. Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions. Am J Hum Genet. 2024 Sept 5;111(9):1819–33. DOI: https://doi.org/10.1016/j.ajhg.2024.07.011.

38. Hripcsak G, Wilcox A. Reference standards, judges, and comparison subjects. J Am Med Inform Assoc. 2002;9(1):1–15. DOI: https://doi.org/10.1136/jamia.2002.0090001.

39. Bockting CL, van Dis EAM, van Rooij R, Zuidema W, Bollen J. Living guidelines for generative AI - why scientists must oversee its use. Nature. 2023 Oct;622(7984):693–6. DOI: https://doi.org/10.1038/d41586-023-03266-1.

40. Cardero R, Sarro E. AI fluency as an essential element towards a smarter workforce [Internet]. HRD; 30 Aug 2025 [cited 23 Sept 2025]. https://www.hrdconnect.com/2025/08/30/ai-fluency-as-an-essential-element-towards-a-smarter-workforce/.

41. Robinson K, Bontekoe K, Muellenbach J. Integrating PICO principles into generative artificial intelligence prompt engineering to enhance information retrieval for medical librarians. J Med Libr Assoc. 2025 Apr 18;113(2):184–8. DOI: https://doi.org/10.5195/jmla.2025.2022.

42. Huang X, Lin J, Demner-Fushman D. Evaluation of PICO as a knowledge representation for clinical questions. AMIA Annu Symp Proc AMIA Symp. 2006;2006:359–63.

43. Stanford Center for Research on Foundation Models. MedHELM - Holistic Evaluation of Language Models (HELM) [Internet]. Stanford University; 2 Jun 2025 [cited 2025 Sept 23]. https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard.

44. Bean AM, Payne R, Parsons G, Kirk HR, Ciro J, Mosquera R, Monsalve SH, Ekanayaka AS, Tarassenko L, Rocher L, Mahdi A. Clinical knowledge in LLMs does not translate to human interactions. arXiv:2504.18919 [Preprint]. arXiv;2025 Apr [cited 24 Sept 2025]. Available from: http://arxiv.org/abs/2504.18919.

45. Rebitschek FG, Carella A, Kohlrausch-Pazin S, Zitzmann M, Steckelberg A, Wilhelm C. Evaluating evidence-based health information from generative AI using a cross-sectional study with laypeople seeking screening information. NPJ Digit Med. 2025 June 9;8(1):343. DOI: https://doi.org/10.1038/s41746-025-01752-6.

Authors

DOI:

Keywords:

Abstract

References

Downloads

Additional Files

Published

Versions

Issue

Section

License

Current Issue