AI chatbots are increasingly proving their worth as diagnostic partners in a variety of fields, particularly healthcare. Beth Israel Deaconess Medical Center (BIDMC) researchers compared a chatbot’s probabilistic reasoning to that of human clinicians. According to the findings, which were published in JAMA Network Open, artificial intelligence could be useful clinical decision support tools for physicians.
“Humans struggle with probabilistic reasoning, the practice of making decisions based on calculating odds,” said Adam Rodman, MD, an internal medicine physician and investigator in BIDMC’s Department of Medicine. “Probabilistic reasoning is one of several components of making a diagnosis, which is an extremely complex process that employs a wide range of cognitive strategies. We chose to evaluate probabilistic reasoning separately because it is a well-known area where humans could benefit from assistance.”
Humans struggle with probabilistic reasoning, the practice of making decisions based on calculating odds. Probabilistic reasoning is one of several components of making a diagnosis, which is an extremely complex process that employs a wide range of cognitive strategies.
Adam Rodman
Rodman and colleagues fed the publicly available Large Language Model (LLM), Chat GPT-4, the same series of cases and ran an identical prompt 100 times to generate a range of responses based on a previously published national survey of more than 550 practitioners performing probabilistic reasoning on five medical cases.
The chatbot, like the practitioners before it, was tasked with estimating the likelihood of a given diagnosis based on a patient’s presentation. The chatbot program then updated its estimates based on test results such as chest radiography for pneumonia, mammography for breast cancer, stress test for coronary artery disease, and urine culture for urinary tract infection.
When the test results were positive, it was a tie; the chatbot was more accurate than humans in two cases, similarly accurate in two cases, and less accurate in one. When the tests came back negative, however, the chatbot shone, demonstrating greater accuracy in diagnosing than humans in all five cases.
“Humans sometimes feel the risk is higher than it is after a negative test result, which can lead to overtreatment, more tests, and too many medications,” Rodman said. However, Rodman is more interested in how highly skilled physicians’ performance might change as a result of having these new supportive technologies available to them in the clinic, according to Rodman. He and his colleagues are looking into it.
“LLMs have no access to the outside world; they don’t calculate probabilities in the same way that epidemiologists or even poker players do. What they’re doing is very similar to how humans make spot probabilistic decisions,” he said. “But that’s what makes it exciting.” Even if they are imperfect, their ease of use and ability to be integrated into clinical workflows could theoretically lead to better decisions by humans,” he says. “Future research into collective human and artificial intelligence is sorely needed.”