Despite increasing artificial intelligence use for healthcare from patients and providers alike, a new study from Mass General Brigham found publicly available generative AI models often fail to properly navigate diagnostic situations.
The study, published April 13 in JAMA Network Open, evaluated 21 different general-purpose large language models (LLMs) on 29 standardized clinical cases from January to December 2025. Models received sequential case transcripts which “preserved clinical context and maintained continuity” throughout the process of clinical reasoning.
Medical student evaluators then scored outputs at each stage against the MSD Manual. Researchers also developed a new measure, dubbed Proportional Index of Medical Evaluation for LLMs (PrIME-LLM), to determine accuracy across five clinical reasoning domains.
Among the LLMs tested by Mass General Brigham’s MESH Incubator researchers were GPT-5, Gemini 3.0 Flash and Grok 4.
While all LLMs achieved a correct final diagnosis more than 90% of the time, researchers found models “performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages.” All models failed to produce an appropriate differential diagnosis more than 80% of the time.
"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, lead author, MESH researcher and M.D.-Ph.D student at Harvard Medical School, in a statement.
MESH Incubator Executive Director Marc Succi, M.D., was one of the study’s corresponding authors. Succi said in a statement that off-the-shelf LLMs “are not ready for unsupervised clinical-grade deployment” despite continual improvements.
“Differential diagnoses are central to clinical reasoning and underlie the ‘art of medicine’ that AI cannot currently replicate,” Succi said.
The new study builds on previous work from Succi and the MESH group. Researchers evaluated the clinical abilities of ChatGPT 3.5 in August 2023, which found the chatbot was about 72% accurate in overall clinical decision making.
Researchers in the present study said most models demonstrated improved accuracy when provided with lab results and imaging in addition to text, with the most recently released models performing better than older models.
Noted limitations included disabled web search and reasoning, inability to fully exclude prior exposure to standardized cases and evaluation does not incorporate model augmentations.
The study emphasized the potential of LLMs to “augment—not replace—physician reasoning.”
“The consistent gap between differential diagnosis and final diagnosis highlights how differently these systems process information compared with physicians,” researchers wrote. “Clinicians preserve uncertainty and iteratively refine differential diagnoses, whereas LLMs collapse prematurely onto single answers, a limitation that persists across model generations.”