close
close

Apre-salomemanzo

Breaking: Beyond Headlines!

Support or replacement of doctors?
aecifo

Support or replacement of doctors?

A new study finds that large language models outperform doctors in diagnostic accuracy, but require strategic integration to improve clinical decision-making without replacing human expertise.

Support or replacement of doctors? Study: Influence of the extended linguistic model on diagnostic reasoning: a randomized clinical trial. Image Credit: Shutterstock AI/Shutterstock.com

In a recent study published in Open JAMA NetworkResearchers are studying whether large language models (LLMs) could improve physicians’ diagnostic reasoning compared to using standard diagnostic resources. It has been found that LLMs perform better alone compared to the performance of groups of doctors using LLMs to diagnose cases.

How can artificial intelligence improve clinical diagnostics?

Diagnostic errors, which can arise from systemic and cognitive problems, can cause significant harm to patients. Thus, improving diagnostic accuracy requires methods to address the cognitive challenges that are part of clinical reasoning. However, common methods such as reflective practices, educational programs, and decision support tools have not effectively improved diagnostic accuracy.

Recent advances in artificial intelligence, particularly LLMs, offer promising support in simulating human-like reasoning and responses. LLMs can also handle complex medical cases and assist in clinical decision-making, while interacting empathetically with the user.

The current use of LLMs in healthcare is largely complementary to improving human expertise. Given the limited training and integration received by healthcare professionals on the use of LLMs in the clinical setting, it is crucial to understand the impact of the use of LLMs in the clinical setting on patient care .

About the study

In the current study, researchers used a single-blind randomized design to assess physicians’ diagnostic reasoning skills using either LLMs or conventional resources. Physicians working in family, emergency, or internal medicine were recruited for the study, with all sessions conducted in person or remotely.

Doctors were given one hour to work on six moderately complex clinical cases presented in a survey tool. Study participants in the intervention group had access to the LLM ChatGPT Plus and GPT-4 tools, whereas study participants in the control group only used conventional resources.

Clinical cases included detailed patient histories, examination and test results. The review and selection of cases followed strict criteria involving four doctors, with selected cases affected by a wide range of medical conditions while excluding simple and extremely rare cases.

Structured reflection was included as a conventional assessment tool. This required participants to list their best differential diagnosisexplain the favorable and opposing factors of the case and choose the most likely diagnosis while suggesting further treatment steps. Responses were scored based on the accuracy of the final diagnosis, as well as diagnostic reasoning.

Objective diagnostic performance of the LLM was assessed using standardized prompts, repeated three times for consistency. Responses were then scored, awarding points for correct reasoning and plausibility of diagnosis.

Statistical analyzes using mixed effects models were also performed to account for within-participant variability, while linear and logistic models were applied to temporal measures and diagnostic performance.

Study results

The use of LLMs by physicians did not improve diagnostic reasoning for difficult cases compared to the use of conventional resources by physicians. However, LLMs alone performed significantly better than physicians in diagnosing cases.

These results were consistent regardless of physicians’ experience level, suggesting that simply providing access to LLMs was not likely to improve diagnostic reasoning.

No significant differences were observed in case resolution ratings between groups. However, further studies using larger sample sizes are needed to determine whether the use of LLM improves effectiveness.

Autonomous LLM performance was better than both human groups, with results similar to those published in similar studies of other LLM technologies. The superior unbiased performance of LLMs is attributed to sensitivity to rapid formulation, which highlights the importance of rapid strategies to maximize the utility of LLMs.

Conclusions

LLMs are immense and promising in effective diagnostic reasoning. Despite the successful diagnoses provided by LLMs in the present study, these results should not be interpreted to indicate that LLMs can provide diagnoses without clinician oversight.

As AI research advances and approaches clinical integration, it will become even more important to reliably measure diagnostic performance using the most clinically realistic evaluation methods and metrics relevant.

Integrating LLMs into clinical practice requires effective strategies for designing structured prompts and training physicians in the use of detailed prompts, which could optimize the performance of physician-LLM collaborations in diagnosis. Nevertheless, the use of LLMs to improve diagnostic reasoning involves the use of these tools as complements, rather than replacements, for physician expertise in the clinical decision-making process.

Journal reference:

  • Goh, E., Gallo, R., Hom, J., et al. (2024). Influence of the extended linguistic model on diagnostic reasoning: a randomized clinical trial. Open JAMA Network 7(10); e2440969–e2440969. doi:10.1001/jamanetworkopen.2024.40969.