ICLR 2026
Existing Med-VQA datasets are not well-suited for systematically studying visual grounding (subfig. a): it's difficult to disentangle whether model errors arise from inadequate semantic grounding (i.e., the model is unable to determine what to look for) or from inadequate visual grounding (i.e., the model is unable to localize the relevant image region even when it knows what to look for). To disentangle visual grounding from semantic grounding, we design VGMED (subfig. b), a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs.
Medical MLLMs demonstrate suboptimal visual grounding when applied to medical images. Analysis using our proposed VGMED dataset shows that all evaluated medical MLLMs exhibit substantial weaker alignment between their attention distributions and ground-truth annotations on medical images compared to natural scene images (from MS COCO).
This failure mode contributed to their underperformance in zero-shot medical image understanding. Further comparison with LLaVA-v1.5 on natural images reinforces this observation: medical MLLMs show significantly lower alignment with annotated regions, highlighting deficiencies in visual grounding for medical image analysis.
We proposed Visual Grounding Refinement (VGRefine): a two-step inference-time method to improve visual grounding in medical MLLMs.
Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models.
| Model | VQA-RAD | SLAKE | PathVQA | PMC-VQA | Avg. |
|---|---|---|---|---|---|
| Qwen-VL-Chat | 47.0 | 56.0 | 55.1 | 36.6 | 48.9 |
| LLaVA-v1.6-7B | 52.6 | 57.9 | 47.9 | 35.5 | 48.5 |
| Med-Flamingo | 45.4 | 43.5 | 54.7 | 23.3 | 41.7 |
| RadFM | 50.6 | 34.6 | 38.7 | 25.9 | 37.5 |
| LLaVA-Med-7B | 51.4 | 48.6 | 56.8 | 24.7 | 45.4 |
| LLaVA-Tri | 59.8 | 43.4 | 59.0 | - | - |
| HuatuoGPT-V-7B | 67.4 | 76.5 | 60.7 | 53.9 | 65.3 |
| VGRefine (Ours) | 71.2 | 76.9 | 67.6 | 56.2 | 68.4 |
@inproceedings{liu2026how,
title = {How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images},
author = {Guimeng Liu
and Tianze Yu
and Somayeh Ebrahimkhani
and Lin Zhi Zheng Shawn
and Kok Pin Ng
and Ngai-Man Cheung},
booktitle = {The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=dXshexyFKx}
}