John Halamka on the risks and benefits of clinical LLMs
Photo: HIMSS Media
ORLANDO – At HIMSS24 on Tuesday, Dr. John Halamka, president of Mayo Clinic Platform, offered a frank discussion about the substantial potential benefits – and very real potential for harm – in both predictive and generative artificial intelligence used in clinical settings.
Healthcare AI has a credibility problem, he said. Mostly because the models so often lack transparency and accountability.
"Do you have any idea what training data was used on the algorithm, predictive or generative, you're using now?" Halamka asked. "Is the result of that predictive algorithm consistent and reliable? Has it been tested in a clinical trial?"
The goal, he said, is to figure out some strategies so "the AI future we all want is as safe as we all need."
It starts with good data, of course. And that's easier discussed than achieved.
"All algorithms are trained on data," said Halamka. "And the data that we use must be curated, normalized. We must understand who gathered it and for what purpose – that part is actually pretty tough."
For instance, "I don't know if any of you have actually studied the data integrity of your electronic health record systems, and your databases and your institutions, but you will actually find things like social determinants of health are poorly gathered, poorly representative," he explained. "They're sparse data, and they may not actually reflect reality. So if you use social determinants of health for any of these algorithms, you're very likely to get a highly biased result."
More questions to be answered: "Who is presenting that data to you? Your providers? Your patients? Is it coming from telemetry? Is it coming from automated systems that extract metadata from images?"
Once those questions are answered satisfactorily, that you've made sure the data has been gathered in a comprehensive enough fashion to develop the algorithm you want, then it's just a question of identifying potential biases and mitigating them. Easy enough, right?
"In the dataset that you have, what are the multimodal data elements? Just patient registration is probably not sufficient to create an AI model. Do you have such things as text, the notes, the history and physical [exam], the operative note, the diagnostic information? Do you have images? Do you have telemetry? Do you have genomics? Digital pathology? That is going to give you a sense of data depth – multiple different kinds of data, which are probably going to be used increasingly as we develop different algorithms that look beyond just structured and unstructured data."
Then it's time to think about data breadth. "How many patients do you have? I talked to several colleagues internationally that say, well, we have a registry of 5,000 patients, and we're going to develop AI on that registry. Well, 5,000 is probably not breadth enough to give you a highly resilient model."
And what about "heterogeneity or spread?" Halamka asked. "Mayo has 11.2 million patients in Arizona, Florida, Minnesota and internationally. But does it offer a representative data of France, or a representative Nordic population?"
As he sees it, "any dataset from any one institution is probably going to lack the spread to create algorithms that can be globally applied," said Halamka.
In fact, you could probably argue there is no one who can create an unbiased algorithm developed in one geography that will work in another geography seamlessly.
What that implies, he said, is you need a global network of federated participants that will help with model creation and model testing and local tuning if we're going to deliver the AI result we want on a global basis."
On that front, one of the biggest challenges is that "not every country on the planet has fully digitized records," said Halamka, who was recently in Davos, Switzerland for the World Economic Forum.
"Why haven't we created an amazing AI model in Switzerland?" he asked. "Well, Switzerland has extremely good chocolate – and extremely bad electronic health records. And about 90% of the data of Switzerland is on paper."
But even with good digitized data. And even after accounting for that data's depth, breadth and the spread, there are still other questions to consider. For instance, what data should be included in the model?
"If you want a fair, appropriate, valid, effective and safe algorithm, should you use race ethnicity as an input to your AI model? The answer is to be really careful with doing that, because it may very well bias the model in ways you don't want," said Halamka.
"If there was some sort of biological reason to have race ethnicity as a data element, OK, maybe it's helpful. But if it's really not related to a disease state or an outcome you're predicting, you're going to find – and I'm sure you've all read the literature about overtreatment, undertreatment, overdiagnosis – these kinds of problems. So you have to be very careful when you decide to build the model, what data to include."
Even more steps: "Then, once you have the model, you need to test it on data that's not the development set, and that may be a segregated data set in your organization, or maybe another organization in your region or around the world. And the question I would ask you all is, what do you measure? How do you evaluate a model to make sure that it is fair? What does it mean to be fair?"
Halamka has been working for some time with the Coalition for Health AI, which was founded with the idea that, "if we're going to define what it means to be fair, or effective, or safe, that we're going to have to do it as a community."
CHAI started with just six organizations. Today, it's got 1,500 members from around the world, including all the big tech organizations, academic medical centers, regional healthcare systems payers, pharma and government.
"You now have a public private organization capable of working as a community to define what it means to be fair, how you should measure what is a testing and evaluation framework, so we can create data cards, what data went into the system and model cards, how do they perform?"
It's a fact that every algorithm will have some sort of inherent bias, said Halamka.
That's why "Mayo has an assurance lab, and we test commercial algorithms and self-developed algorithms," he said. "And what you do is you identify the bias and then you mitigate it. It can be mitigated by returning the algorithm to different kinds of data, or just an understanding that the algorithm can't be completely fair for all patients. You just have to be exceedingly careful where and how you use it.
"For example, Mayo has a wonderful cardiology algorithm that will predict cardiac mortality, and it has incredible predictive, positive predictive value for a body mass index that is low and a really not good performance for a body mass index that is high. So is it ethical to use that algorithm? Well, yes, on people whose body mass index is low, and you just need to understand that bias and use it appropriately."
Halamka noted that the Coalition for Health AI has created an extensive series of metrics and artifacts and processes – available at CoalitionforHealthAI.org. "They're all for free. They're international. They're for download."
Over the next few months, CHAI "will be turning its attention to a lot of generative AI topics," he said. "Because generative AI evaluation is harder.
With predictive models, "I can understand what data went in, what data comes out, how it performs against ground truth. Did you have the diagnosis or not? Was the recommendation used or helpful?
With generative AI, "It may be a completely well-developed technology, but based on the prompt you give it, the answer could either be accurate or kill the patient."
Halamka offered a real example.
"We took a New England Journal of Medicine CPC case and gave it to a commercial narrative AI product. The case said the following: The patient is a 59-year-old with crushing, substantial chest pain, shortness of breath – and left leg radiation.
"Now, for the clinicians in the room, you know that left leg radiation is kind of odd. But remember, our generative AI systems are trained to look at language. And, yeah, they've seen that radiation thing on chest pain cases a thousand times.
"So ask the following question on ChatGPT or Anthropic or whatever it is you're using: What is the diagnosis? The diagnosis came back: 'This patient is having myocardial infarction. Anticoagulate them immediately.'
"But then ask a different question: 'What diagnosis shouldn't I miss?'"
To that query, the AI responded: "'Oh, don't miss dissecting aortic aneurysm and, of course, left leg pain,'" said Halamka. "In this case, this was an aortic aneurysm – for which anticoagulation would have instantly killed the patient.
"So there you go. If you have a product, depending on the question you ask, it either gives you a wonderful bit of guidance or kills the patient. That is not what I would call a highly reliable product. So you have to be exceedingly careful."
At the Mayo Clinic, "we've done a lot of derisking," he said. "We've figured how to de identify data and how to keep it safe, the generation of models, how to build an international coalition of organizations, how to do validation, how to do deployment."
Not every health system is as advanced and well-resourced as Mayo, of course.
"But my hope is, as all of you are on your AI journey – predictive and generative – that you can take some of the lessons that we've learned, take some of the artifacts freely available from the Coalition for Health AI, and build a virtuous life cycle in your own organization, so that we'll get the benefits of all this AI we need while doing no patient harm," he said.
Mike Miliard is executive editor of Healthcare IT News
Email the writer: mike.miliard@himssmedia.com
Healthcare IT News is a HIMSS publication.