Atropos compares how well leading LLMs generate clinical evidence

A new study released by healthcare AI company Atropos found that general-purpose LLMs like ChatGPT, Claude and Gemini are not fit for use in clinical decision-making.

Atropos conducted a study of leading large language models (LLMs), including general purpose LLMs, a model built for healthcare, and its own model, ChatRWD beta, to answer 50 healthcare questions submitted by clinicians requesting more evidence for clinical decisions or inspired by clinician questions. 

Nine independent clinicians rated the responses of five large language models based on relevance, reliability and actions able to be taken from the model’s response.

The study, published on open-access forum Arxiv, found that general-purpose LLMs like ChatGPT only provided relevant information for healthcare professionals’ questions 2% -10% of the time. Language models built specifically for healthcare performed better, pulling relevant insights 24% of the time. Atropos’ own LLM—ChatRWD—performed the best, pulling relevant data 58% of the time. 

Atropos’ ChatRWD queries its data and generates an answer to the question based on the 160 million de-identified patient records available to it. While other LLMs were only able to answer novel questions 0-9% of the time, ChatRWD was able to answer them 65% of the time, the study found.

“In scenarios where no published literature has existed before it was actually able to answer those kinds of questions, as opposed to the other LLMs, which said, 'Hey, no reliable evidence actually existed',” Saraub Gombar, chief medical officer of Atropos, explained to Fierce Healthcare. 

Gombar clarified that there will always be a need for randomized controlled trials (RCT) and academics to publish medical literature. He considers multi-center randomized controlled trials to be a step above ChatRWD’s real-world data analysis. 

But gaps exist in medical literature, and ChatRWD can help fill those gaps. 

“One of the large challenges that we face in medicine is that every time we see a patient ideally we want this mountain of evidence behind us to say ‘For this patient, I'm going to do this treatment and that treatment is going to lead to the best results for my patients,’” Gombar said. “It just turns out that mountain of evidence is frequently a molehill, and one of the reasons is these large randomized controlled trials, large prospective trials, they exclude patients with multiple comorbid conditions.”

The off-the-shelf large language models tested included OpenAI’s ChatGPT, Google’s Gemini, and Athropic’s Claude, which reliably answered healthcare professionals’ questions 2%–10% of the time. Gombar said the models would directly answer questions but some of the studies were hallucinated. The LLMs use the entirety of the Internet as training data, including Wikipedia and social media, Gombar said, which could produce unreliable information. 

OpenEvidence was trained only with peer-reviewed medical literature. Atropos’ study found the models produce reliable and relevant information to 24% of clinicians’ questions. In the case that the topic has a solid evidence base, the models answered questions reliably 100% of the time, Gombar said.

To have the ability to answer 100% of medical questions, Gombar said there needs to be more data, like genetic information and social determinants of health data. 

He pointed to companies that are already collecting genetic information like 23andMe and other institutions that profile a patient’s ability to metabolize drugs. Currently, this data is siloed and thus does not translate to a patient’s medical records. Gombar said interoperability and data-sharing incentives would have to change to pull that data together.