How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

ICLR 2026

Guimeng Liu1,*, Tianze Yu1,*, Somayeh Ebrahimkhani1,*, Lin Zhi Zheng Shawn2,3,
Kok Pin Ng2,3,†, Ngai-Man Cheung1,†,‡
1 Singapore University of Technology and Design, Singapore
2 Department of Neurology, National Neuroscience Institute, Singapore
3 Duke-NUS Medical School, Singapore
* Equal first-author contribution Corresponding author Project lead

Summary

Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. Our work, for the first time, systematically validates inadequate visual grounding in clinically relevant image regions as one of the key contributing factors for medical MLLMs' under-performance. We note that this finding is specific to medical image analysis. In contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images.

Visual Grounding Failure in Medical MLLMs

We show the visual grounding issues in state-of-the-art medical MLLMs that they often suffer from inadequate visual grounding—they fail to accurately localize and interpret image regions that are clinically relevant to the question (subfig. a). In contrast, when applied to natural images, MLLMs are capable of grounding their predictions in the correct image regions (subfig. b). This limitation contributes to medical MLLMs' suboptimal performance on medical VQA tasks.
Failure modes of Medical MLLMs figure

VGMED Dataset

Existing Med-VQA datasets are not well-suited for systematically studying visual grounding (subfig. a): it's difficult to disentangle whether model errors arise from inadequate semantic grounding (i.e., the model is unable to determine what to look for) or from inadequate visual grounding (i.e., the model is unable to localize the relevant image region even when it knows what to look for). To disentangle visual grounding from semantic grounding, we design VGMED (subfig. b), a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs.

VGMED dataset overview figure

Observations

Medical MLLMs demonstrate suboptimal visual grounding when applied to medical images. Analysis using our proposed VGMED dataset shows that all evaluated medical MLLMs exhibit substantial weaker alignment between their attention distributions and ground-truth annotations on medical images compared to natural scene images (from MS COCO).

This failure mode contributed to their underperformance in zero-shot medical image understanding. Further comparison with LLaVA-v1.5 on natural images reinforces this observation: medical MLLMs show significantly lower alignment with annotated regions, highlighting deficiencies in visual grounding for medical image analysis.

Analysis results for visual grounding on VGMED

Method

We proposed Visual Grounding Refinement (VGRefine): a two-step inference-time method to improve visual grounding in medical MLLMs.

  • Step I (Attention Triage): We aggregate attention from the model's most visually sensitive heads and suppress low-confident attention, obtaining a binary mask.
  • Step II (Attention Knockout): We use this mask to refine the model's attention distribution, improving its focus on relevant regions during inference.
VGRefine method overview

Experimental Results

Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models.

Model VQA-RAD SLAKE PathVQA PMC-VQA Avg.
Qwen-VL-Chat 47.0 56.0 55.1 36.6 48.9
LLaVA-v1.6-7B 52.6 57.9 47.9 35.5 48.5
Med-Flamingo 45.4 43.5 54.7 23.3 41.7
RadFM 50.6 34.6 38.7 25.9 37.5
LLaVA-Med-7B 51.4 48.6 56.8 24.7 45.4
LLaVA-Tri 59.8 43.4 59.0 - -
HuatuoGPT-V-7B 67.4 76.5 60.7 53.9 65.3
VGRefine (Ours) 71.2 76.9 67.6 56.2 68.4
Experiments results figure

BibTeX

@inproceedings{liu2026how,
  title     =   {How Do Medical MLLMs Fail?  A Study on Visual Grounding in Medical Images},
  author    =   {Guimeng Liu 
              and Tianze Yu 
              and Somayeh Ebrahimkhani 
              and Lin Zhi Zheng Shawn 
              and Kok Pin Ng 
              and Ngai-Man Cheung},
  booktitle =  {The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=dXshexyFKx}
}