A new study reveals that artificial intelligence can now diagnose complex medical cases better than human doctors, but using the tool doesn’t necessarily help physicians improve their own results.
For years, the promise of AI in healthcare has been framed as a partnership: a “co-pilot” that handles data so doctors can focus on care. However, a randomized clinical trial published in JAMA Network Open challenges this optimistic view. The researchers found that while ChatGPT alone demonstrated exceptional diagnostic accuracy, doctors who had access to the chatbot performed only marginally better than those who used standard resources.
This finding raises uncomfortable questions about the future of the medical profession. If the “intern” is smarter than the attending physician, how should they work together?
The Study: ChatGPT vs. Physicians
Researchers from Stanford University and other institutions conducted a randomized clinical trial involving 50 licensed physicians. They presented the doctors with varied clinical case vignettes—complex descriptions of patient histories, symptoms, and lab results—and asked them to provide a diagnosis.
The participants were divided into two groups:
- Group A: Used conventional resources (like UpToDate or Google search).
- Group B: Used ChatGPT (GPT-4) along with conventional resources.
The results were striking. ChatGPT, when operating alone, scored significantly higher than both groups of humans.
- ChatGPT alone: Achieved a median diagnostic reasoning score of roughly 90%.
- Physicians with ChatGPT: Scored a median of 76%.
- Physicians without ChatGPT: Scored a median of 74%.
The difference between the two groups of doctors was statistically insignificant. In other words, giving a doctor the world’s most powerful diagnostic AI didn’t turn them into a super-diagnostician. It barely moved the needle.
Why doctors struggled to use the AI
Why didn’t the human-AI teams crush the test? The study authors and analysts point to two main human factors: trust and skill.
Anchoring bias
Doctors, like all humans, are subject to “anchoring bias”—the tendency to rely too heavily on the first piece of information they receive or their own initial intuition. In many cases, even when ChatGPT suggested the correct diagnosis, physicians stuck to their original, incorrect conclusion. They treated the AI as a search engine to confirm what they already believed, rather than an expert consultant that might correct them.
The prompt engineering gap
Using a Large Language Model (LLM) effectively requires skill. It is not just a database; it is a reasoning engine that responds to the quality of the question. The study suggests that many physicians did not know how to prompt the AI effectively to extract its full diagnostic potential. They may have asked simple questions rather than engaging the model in a step-by-step clinical reasoning dialogue. As studies show, people with better thinking skills are better with tech, and this gap becomes critical when using advanced AI tools.
The human cost: productivity vs. satisfaction
The rise of capable AI “interns” brings another risk: the erosion of professional satisfaction. The Guardian highlights a separate experiment from MIT involving material scientists.
In that study, scientists used an AI tool to help discover new materials. The AI was incredibly effective:
- 44% more materials were discovered.
- 39% increase in patent filings.
However, the human cost was high. The AI took over the creative “idea generation” phase, leaving the highly trained scientists to handle the mundane task of verifying the AI’s suggestions. As a result, 82% of the researchers reported a reduction in job satisfaction. They felt less like inventors and more like cogs in a machine. While we know that AI will speed up scientific research far more than most people expect, the impact on the scientists’ morale is a new concern.
This mirrors the “Gods, Interns, and Cogs” framework proposed by anthropologist Drew Breunig. While we fear AI “Gods” (superintelligence), we are currently deploying AI “Interns” (LLMs). If professionals are not careful, they risk being demoted to “Cogs”—merely checking the work of a superior digital intellect.
Context: Google’s AMIE and the future
This is not the first time AI has shown diagnostic prowess. Earlier in 2024, Google DeepMind introduced AMIE (Articulate Medical Intelligence Explorer), a research AI system optimized for diagnostic dialogue. In simulations, AMIE outperformed primary care physicians in both diagnostic accuracy and empathy.
The trend is clear. Whether it is Google AI makes better diagnoses than human doctors or OpenAI’s general-purpose ChatGPT, algorithms are rapidly surpassing human benchmarks in cognitive tasks that were once the exclusive domain of highly trained professionals.
What you can do about it
For Patients:
- Don’t ditch your doctor. AI makes mistakes (hallucinations) and cannot perform physical exams or show genuine human care.
- Use AI as a second opinion. If you have a complex condition, summarizing your symptoms for a secure, private AI tool might generate questions you can ask your doctor. “I read that Condition X has these symptoms; could that be a possibility?”
For Professionals:
- Learn to collaborate. The doctors who failed to improve with AI treated it like a search bar. The future belongs to those who treat AI as a reasoning partner—challenging it, asking it to check for blind spots, and knowing when to trust it.
- Protect your joy. To avoid the “cog” trap, professionals must find ways to use AI to eliminate drudgery without surrendering the creative and problem-solving aspects that make work meaningful.
Sources & related information
JAMA Network Open – Large Language Model Influence on Diagnostic Reasoning – 2024
A randomized clinical trial finding that GPT-4 alone outperformed physicians in diagnostic reasoning scores, while physicians using the tool did not significantly improve compared to those without it.
The Guardian – If AI can provide a better diagnosis than a doctor, what’s the prognosis for medics? – 2024
An analysis of recent studies on AI in healthcare, discussing the “Gods, Interns, and Cogs” framework and the impact of AI on professional job satisfaction.
MIT Sloan – Generative AI enhances creativity but reduces job satisfaction – 2024
A working paper detailing an experiment with material scientists where AI boosted productivity and patent filings but led to a significant drop in reported job satisfaction among researchers.
Google Research – AMIE: A research AI system for diagnostic medical reasoning – 2024
Research introducing an AI system trained for diagnostic dialogue that outperformed primary care physicians in simulated text-based consultations.
0 Comments