DoD to develop scalable genAI testing datasets
Photo: Roberto Westbrook/Getty Images
The U.S. Department of Defense's Chief Digital and Artificial Intelligence Office and technology nonprofit Humane Intelligence announced the conclusion of the agency's Crowdsourced Artificial Intelligence Red-Teaming Assurance Program pilot, which is focused on testing large language model chatbots used in military medical services.
The findings could ultimately improve military medical care by adhering to all required risk management practices for the use of AI, DoD officials said.
WHY IT MATTERS
In an announcement Thursday, DoD said the CAIRT program's most recent red-team test involved more than 200 agency clinical providers and healthcare analysts to compare three LLMs for two prospective use cases: clinical note summarization and a medical advisory chatbot.
They found more than 800 potential vulnerabilities and biases where LLMs are being tested to enhance military medical care.
CAIRT aimed to build a community of practice around algorithmic evaluations in collaboration with the Defense Health Agency and the Program Executive Office, Defense Healthcare Management Systems. In 2024, the program also offered a financial AI bias bounty focused on unknown risks in LLMs, beginning with open-source chatbots.
Crowdsourcing casts a wide net that can produce large volumes of data across multiple stakeholders. DoD said the findings from all CAIRT program red-teaming efforts will be crucial to shaping policies and best practices for the responsible use of generative AI.
DoD also said continued testing of LLMs and AI systems through the CAIRT Assurance Program is critical to accelerating AI capabilities and justifying confidence across DoD genAI use cases.
THE LARGER TREND
Trust is essential for clinicians to embrace AI. To use genAI in clinical care, LLMs must meet critical performance expectations to best assure providers that the tools are useful, transparent, explainable and secure, as Dr. Sonya Makhni, medical director of applied informatics at Mayo Clinic Platform, told Healthcare IT News recently.
Despite the enormous potential for the positive use of AI in healthcare delivery, "unlocking that is challenging," said Makhni at the HIMSS AI in Healthcare Forum this past September.
Because "assumptions and decisions are made during each step of the AI development life cycle, and if incorrect these assumptions can lead to systematic errors," allowing bias to creep in, Makhni explained when asked about how to deliver the safe use of AI.
"Such errors can skew the end result of an algorithm against a subgroup of patients and ultimately pose risks to healthcare equity," she continued. "This phenomenon has been demonstrated in existing algorithms."
To test performance and eliminate algorithmic bias, clinicians and developers must work together collaboratively, "throughout the AI development life cycle and through solution deployment," Makhni advised.
"Active engagement from both parties is necessary in predicting potential areas of bias and/or suboptimal performance," she added. "This knowledge will help clarify contexts that are better suited to a given AI algorithm and those that perhaps require more monitoring and oversight."
ON THE RECORD
"Since applying GenAI for such purposes within the DoD is in earlier stages of piloting and experimentation, this program acts as an essential pathfinder for generating a mass of testing data, surfacing areas for consideration and validating mitigation options that will shape future research, development and assurance of GenAI systems that may be deployed in the future," said Dr. Matthew Johnson CAIRT program lead, in a Jan. 2 statement about the initiative.
Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org
Healthcare IT News is a HIMSS Media publication.