As AI backlash builds across industries, a medical imaging study published late last year warrants a second look by healthcare innovators. Researchers led by Robert Kaczmarczyk from the Technical University of Munich, with corresponding author Theresa Isabelle Wilhelm from the University of Freiburg and colleagues, evaluated how AI performs on difficult cases from the New England Journal of Medicine’s Image Challenge, where individual physicians average only 49.4% accuracy on these intentionally challenging educational cases. Anthropic’s Claude 3 AI family achieved 58.8-59.8% accuracy diagnosing these complex cases, surpassing the participants’ average vote (49.4%) by around 10%. Published in npj Digital Medicine, the study reveals both the promise and current limitations of AI-assisted diagnosis in challenging medical scenarios.
Key Points
- Claude 3 family achieved 58.8-59.8% accuracy on 945 medical image cases, surpassing the participants’ average vote by around 10%.
- The study analyzed responses from over 85 million human votes on New England Journal of Medicine image challenges.
- GPT-4 Vision showed concerning selectivity, refusing to answer 24% of medical questions due to overly restrictive safety filters.
- Collective human intelligence still dominated, with majority vote achieving 90.8% accuracy across all cases.
- Models lack transparency about training data, raising questions about whether test images were previously seen during training.
- The authors declared no competing interests.
Diagnostic assistance technology shows potential for augmenting physician expertise, particularly in resource-limited settings. Yet with significant implementation costs and ongoing concerns about AI hallucinations in medical contexts, widespread adoption will require both technological refinement and careful regulatory navigation.
The Data
- Claude 3 models correctly diagnosed 556-565 out of 945 cases (58.8-59.8%) compared to 467 correct answers (49.4%) from individual physicians on average.
- Open-source models like CogVLM and LLaVA scored between 30-47% accuracy, significantly underperforming proprietary systems.
- GPT-4V demonstrated statistically significant bias toward easier cases (p=0.033), preferentially answering questions where humans achieved a 50% success rate while avoiding harder ones where humans averaged 47.6%.
- The technology struggles with complex cases where human experts excel, and no model approached the 90.8% accuracy of collective human decision-making.
The researchers note several important limitations that contextualize these findings:
- The evaluated models are general-purpose AI systems, not custom-designed for medical tasks.
- Uncertainty exists about whether test images appeared in models’ training data, which could inflate performance metrics (a concern the authors term “dataset contamination”).
- The multiple-choice format of the NEJM challenges may not fully capture the complexities encountered in real-world clinical settings.
- Each model was tested with a single configuration without parameter adjustments to evaluate base capabilities.
- These results come from educational challenges designed to be difficult, not from actual clinical scenarios.
The study evaluated nine multimodal AI models on 945 NEJM Image Challenge cases published through December 7, 2023. Researchers used a standardized prompt template asking models to act as expert physicians. The analysis included both proprietary models (Claude 3 family, GPT-4 Vision, Gemini) and open-source alternatives (CogVLM, LLaVA, InternVL). Testing was primarily conducted in December 2023, with some additional models evaluated in May 2024.
Industry Context
The future of AI in medicine depends on collaborative efforts to enhance its reliability and ethical application, with the goal of complementing—rather than replacing—human expertise.
Robert Kaczmarczyk and colleagues
The findings position Anthropic’s Claude 3 family as a strong performer in this specific evaluation of multimodal AI on medical educational challenges, outperforming OpenAI’s GPT-4 Vision and Google’s Gemini models in diagnostic accuracy on the NEJM dataset. The findings could influence discussions in the medical imaging AI sector, where specialized diagnostic tools from companies like Aidoc, Viz.ai, and Zebra Medical Vision currently compete for hospital contracts.
However, the path to clinical implementation faces substantial hurdles. The EU AI Act’s recent passage creates stringent requirements for high-risk medical AI systems, demanding transparency that proprietary models currently lack. The study’s authors note that open-source models may have advantages in meeting transparency requirements, though they performed less well in this evaluation.
The authors emphasize that clinical trials, specialized medical AI models, and greater transparency in training data are essential next steps before clinical deployment. They stress the need to move beyond multiple-choice evaluations to testing formats that better reflect real-world clinical complexity. While this research demonstrates promising AI capabilities in medical image interpretation, substantial collaborative work between developers, medical professionals, and regulators remains necessary for safe clinical integration.
The article, “Evaluating multimodal AI in medical diagnostics,” was published in npj Digital Medicine, August 2024 (DOI: 10.1038/s41746-024-01208-3).