ChatGPT scored 72% in clinical decision accuracy, MGB study shows

The large language model's performance was steady across both primary and emergency care, and for all medical specialties, but struggled with differential diagnoses, according to new research by Mass General Brigham.
By Andrea Fox
10:52 AM

Photo: Leon Neal/Getty

Putting ChatGPT to the test to see if AI can work through an entire clinical encounter with a patient – recommending a diagnostic workup, deciding a course of action and making a final diagnosis – Mass General Brigham researchers have found the large language model to have "impressive accuracy" despite limitations, including possible hallucinations

WHY IT MATTERS

Researchers from the Innovation in Operations Research Center at MGB trained ChatGPT, a large-language model (LLM) artificial intelligence chatbot, on all 36 published clinical vignettes from the Merck Sharpe & Dohme clinical manual and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis and management based on patient age, gender and case acuity. 

"No real benchmarks exist, but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident," Dr. Marc Succi, associate chair of innovation and commercialization and strategic innovation leader at MGB and executive director of its MESH Incubator's Innovation in Operations Research Group, or MESH IO, said in a statement.

The researchers said that ChatGPT achieved an overall accuracy of 71.7% in clinical decision making across all 36 clinical vignettes. ChatGPT came up with possible diagnoses and made final diagnoses and care management decisions.

They measured the popular LLM's accuracy on differential diagnosis, diagnostic testing, final diagnosis and management in a structured blinded process, awarding points for correct answers to questions posed. Researchers then used linear regression to assess the relationship between ChatGPT's performance and the vignette's demographic information, according to the study published this past week in the Journal of Medical Internet Research.

ChatGPT proved best in making a final diagnosis, where the AI had 77% accuracy in the study, funded in part by the National Institute of General Medical Sciences. 

It was lowest-performing in making differential diagnoses, where it was only 60% accurate, and in clinical management decisions, underperforming at 68% accuracy based on the clinical data the LLM was trained on. 

This is good news for those who have questioned whether ChatGPT can really outshine doctors' expertise.

"ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do," Succi said. "That is important because it tells us where physicians are truly experts and adding the most value – in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed." 

Before tools like ChatGPT can be considered for integration into clinical care, more benchmark research and regulatory guidance is needed, according to MGB. Next, MESH IO is looking at whether AI tools can improve patient care and outcomes in hospitals' resource-constrained areas.

THE LARGER TREND

While most ChatGPT tools created in health tech focus on cutting physician burnout by streamlining documentation tasks or searching for data and answering patient questions, one of the biggest considerations the industry faces with AI is trust, according to Dr. Blackford Middleton, an independent consultant and former chief medical information officer at Stanford Health Care.

In order to convince clinicians at healthcare provider organizations to trust an AI system that health systems want to implement, transparency is key. The ability to provide feedback is also essential, "like a post-marketing surveillance of drugs," when AI is involved in decision-making so that developers can fine-tune systems, Middleton said on HIMSSCast in June. 

Knowing what the training data and update cycles are behind the LLM is vital because clinical decision-making with AI is a "green" field. 

However, he said, "My belief is that we will have – in the healthcare delivery scenario – we will have many systems running concurrently."

ON THE RECORD

"Mass General Brigham sees great promise for LLMs to help improve care delivery and clinician experience," Dr. Adam Landman, chief information officer and senior vice president of digital at MGB and study co-author, said in a statement. 

Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org

Healthcare IT News is a HIMSS Media publication.

Want to get more stories like this one? Get daily news updates from Healthcare IT News.
Your subscription has been saved.
Something went wrong. Please try again.