OpenAI's general purpose speech recognition model is flawed, researchers say

The AP reports that OpenAI's Whisper documentation platform is prone to hallucinations, and to making up sentences and sections of text across millions of recordings. Tens of thousands of transcriptions could be faulty.
By Andrea Fox
09:45 AM

Photo: Jamie Grill/Getty Images

The Associated Press reported recently that it has interviewed more than a dozen software engineers, developers and academic researchers who take issue with a claim by artificial intelligence developer OpenAI that one of its machine learning tools, which is used in clinical documentation at many U.S. health systems, has human-like accuracy.

WHY IT MATTERS

Researchers at the University of Michigan and others found that AI hallucinations resulted in erroneous transcripts – sometimes with racial and violent rhetoric, in addition to imagined medical treatments – according to the AP.

Of concern is the widespread uptake of tools that use Whisper – available open source or as an API – that could lead to erroneous patient diagnoses or poor medical decision-making.

Hint Health is one clinical technology vendor that added the Whisper API last year in order to give doctors the ability to record patient consultations within the vendor's app and transcribe them with OpenAI's large language models.

Meanwhile, more than 30,000 clinicians and 40 health systems, such as Children’s Hospital Los Angeles, use ambient AI from Nabla that incorporates a Whisper-based tool. Nabla said Whisper has been used to transcribe approximately seven million medical visits, according to the report

A spokesperson for that company cited a blog posted on Monday that addresses the specific steps the company takes to ensure models are appropriately used and monitored in usage. 

"Nabla detects incorrectly generated content based on manual edits to the note and plain language feedback," the company said in the blog. "This provides a precise measure of real-world performance and gives us additional inputs to improve models over time."

Of note, Whisper is also integrated into some versions of OpenAI’s flagship chatbot ChatGPT, and is a built-in offering in Oracle and Microsoft’s cloud computing platforms, according to the AP.

Meanwhile, OpenAI warns users that the tool should not be used in "high-risk domains" and recommends in its online disclosures against using Whisper in "decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes."

"Will the next model improve on the issue of large-v3 generating a significant amount of hallucinations?," one user asked on OpenAI's GitHub Whisper discussion board on Tuesday. A question that was unanswered at press time.

"This seems solvable if the company is willing to prioritize it,” William Saunders, a San Francisco-based research engineer who left OpenAI earlier this year, told the AP. "It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems."

Of note, OpenAI recently posted a job opening for a health AI research scientist, whose chief responsibilities would be to "design and apply practical and scalable methods to improve safety and reliability of our models" and "evaluate methods using health-related data, ensuring models provide accurate, reliable and trustworthy information."

THE LARGER TREND

In September, Texas Attorney General Ken Paxton announced a settlement with Dallas-based artificial intelligence developer Pieces Technologies over allegations that the company's generative AI tools had put patient safety at risk by overpromising accuracy. That company uses genAI to summarize real-time electronic health record data about patient conditions and treatments.

And in a study looking at LLM accuracy in producing medical notes by the University of Massachusetts Amherst and Mendel, an AI company focused on AI hallucination detection, there were many errors. 

Researchers compared Open AI's GPT-4o and Meta's Llama-3 and found of 50 medical notes, GPT had 21 summaries with incorrect information and 50 with generalized information, while Llama had 19 errors and 47 generalizations. 

ON THE RECORD

"We take this issue seriously and are continually working to improve the accuracy of our models, including reducing hallucinations," a spokesperson for OpenAI told Healthcare IT News by email Tuesday. 

"For Whisper use on our API platform, our usage policies prohibit use in certain high-stakes decision-making contexts, and our model card for open-source use includes recommendations against use in high-risk domains. We thank researchers for sharing their findings."

Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org

Healthcare IT News is a HIMSS Media publication.

Want to get more stories like this one? Get daily news updates from Healthcare IT News.
Your subscription has been saved.
Something went wrong. Please try again.