Study: AI can flag cognitive decline in clinical notes nearly on par with humans

A study using agentic artificial intelligence to detect early signs of cognitive decline in unstructured medical records found the technology achieved near-expert performance without any human guidance.

Mass General Brigham researchers built a multi-agent workflow that relied on five debating AI agents using large language models (LLMs) from Meta: Llama and Med42. The data were based on 200 real MGB patients and more than 3,300 clinical notes. 

The AI came close to matching human-guided performance, achieving over 90% of expert-level accuracy without human intervention.

“We basically built a digital clinical team,” Hossein Estiri, Ph.D., a researcher on the study, director of the Clinical Augmented Intelligence research group and associate professor of medicine at Massachusetts General Hospital, told Fierce Healthcare. “And it was much cheaper, because you don’t need a human. It’s completely autonomous.”

The study looked at model performance across a validation dataset, which resembled real-world conditions, and a refinement dataset, which had more balanced training data. Alongside the study's publication, the researchers released Pythia, an open-source tool to help other researchers deploy autonomous prompt optimization for their own AI screening applications.

Human reviewers initially disagreed with a number of cases reviewed by the AI: It appeared to flag 16 cases as false negatives in the validation dataset, meaning AI had determined no cognitive concern. Ultimately, independent experts sided with the AI in 44% of those cases, meaning the AI correctly ruled out concerns based on available evidence. This was despite being at an information disadvantage—the AI only worked off of clinical notes, while human reviewers had access to complete medical records.

The AI struggled in areas with isolated data points lacking clinical context but excelled at analyzing comprehensive clinical narratives, history of present illness, exam findings and clinical reasoning. 

The Meta LLMs were chosen because “we wanted to make sure that the system that we build, low-resource healthcare systems are able to implement it,” Estiri explained. They don’t need specialized processors like Nvidia GPUs to run; a solid Apple or Dell laptop would suffice. 

Of note, the researchers studied and included in the findings where the agentic workflow failed. They found a “significant” drop in sensitivity in the refinement dataset. The hope is, Estiri said, this will inform others if they want to replicate the approach. The AI system is not generalizable and needs to be customized to each unique health system and its demographics. 

“We have provided an open-source tool, so basically any hospital system regardless of their resource availability can actually go and do this process on their own data,” Estiri said.

Recent advancements in LLMs have transformed the ability to conduct these sorts of studies and have the potential to revolutionize clinical workflows, explained Lidia Moura, M.D., Ph.D., director of population health for the Department of Neurology at MGB and another author on the study. Moura, who also directs the Center for Healthcare Intelligence at MGB, has previously conducted research to understand dementia patterns, but using Medicare claims. She described the experience as “disturbing,” where she saw how everything from quality metrics to reimbursement were being determined by measures that didn’t capture reality. 

“Cognitive change rarely announces itself in a clean, standardized way,” Moura told Fierce Healthcare. It might show up as a missed appointment or a change in how a patient tells their story. “Yet so many policy and healthcare delivery decisions are still based on measures that assume cognition is neatly captured.”

Relying on humans for screening is also not sustainable because visit times are very limited, physicians are in short supply and the older population is growing rapidly. “Even when you do a 3.5-hour psych evaluation, you might not be certain about what is happening, about what's going on. It’s a very subjective topic,” Moura said.

Pulling from rich, unstructured data like clinical notes yields results that have greater sensitivity than relying solely on claims data and simple algorithms, per Moura. But consistently tracking the many complex contextual factors in highly variable free text is challenging even for expert humans. Using LLMs is an obvious solution, “in a way that they are not replacing, but they are aiding, they are supporting, the care delivery.” 

“The work reflects a broader institutional commitment to using advanced analytics to support clinicians by making patterns visible that are otherwise difficult to track at scale,” Moura said.