arxiv: 2605.00876 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CV

Recognition: unknown

GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

Duaa Alim , Mogtaba Alim , Liam Chalcroft

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:51 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords GAZEagentic evaluationvision-language modelsbrain MRIrare neurological conditionslesion localizationzero-shot learningmedical literature retrieval

0 comments

The pith

GAZE framework enables zero-shot medical VLMs to use tools for better rare brain MRI diagnosis and localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard vision-language models analyze medical images in a single pass, unlike radiologists who review images iteratively and consult references. GAZE introduces an agentic system where the model can invoke viewer tools like zoom and contrast adjustment, plus retrieval from PubMed and image databases. Tested on a benchmark of 906 brain MRI scans for 281 rare neurological conditions, it achieves 58.2 mean average precision for lesion localization at IoU 0.3 and 34.9 percent top-1 diagnostic accuracy. Gains are larger for rare cases, and the work shows that tool integration and joint evaluation of multiple tasks are key to improving performance without any task-specific training.

Core claim

GAZE is a grounded agentic zero-shot evaluation framework that equips vision-language models with viewer-level tools for image manipulation and literature retrieval tools, leading to 58.2 mAP at IoU 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy on the NOVA benchmark of 906 brain MRI cases spanning 281 rare conditions, all without task-specific fine-tuning. Structured prompting improves baselines, and tool usage benefits rare pathologies more than common ones, with retrieval showing model-dependent trade-offs between diagnosis and localisation.

What carries the argument

The GAZE framework's integration of viewer-level tools (zoom, windowing, contrast, edge detection) and retrieval tools (PubMed for literature, Open-i for images) with structured, schema-validated outputs and full audit traces.

If this is right

Tool engagement correlates with larger performance gains on rare conditions, raising good localization cases from 17% to 58% for low-example diagnoses.
Retrieval ablations show that diagnosis improvements can come at the cost of localization accuracy depending on the underlying model.
Before tool use, schema-validated structured prompting already lifts performance from 20.2 to 29.4 mAP@0.3 over a Gemini baseline.
Joint scoring of captioning, diagnosis, and localisation is necessary to capture the full effect of agentic workflows in medical VLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agentic tool use might generalize to other medical imaging domains like CT or X-ray for rare conditions.
The recorded tool-call traces could enable post-hoc analysis to identify which tools contribute most to specific diagnostic improvements.
By reducing the need for fine-tuning, this approach could accelerate deployment of diagnostic aids in low-resource settings with limited training data.

Load-bearing premise

The assumption that the introduced viewer tools and literature retrieval do not introduce new biases or artifacts, and that the NOVA benchmark fairly represents real-world diagnostic challenges for rare conditions.

What would settle it

An experiment comparing GAZE outputs to expert radiologist reports on the same NOVA cases, or showing that performance gains disappear when tool access is removed or when using a different set of rare cases not in the benchmark.

Figures

Figures reproduced from arXiv: 2605.00876 by Duaa Alim, Liam Chalcroft, Mogtaba Alim.

**Figure 1.** Figure 1: GAZE architecture overview. The model, not the framework, decides whether another turn is needed and which tool to call next. GAZE validates the returned JSON against a task schema, injects nudge messages when fields are missing or malformed, and manages turn control and error recovery. 3 Methods 3.1 Architecture At a high level, GAZE separates task definitions from the agent machinery (Figure 1). For eac… view at source ↗

**Figure 2.** Figure 2: Retrieval trade-off for Qwen3.5-122B-A10B. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Vision-language models (VLMs) read an image and produce text in a single forward pass, whereas radiologists typically inspect an image several times and consult the literature before writing a report. We introduce GAZE (Grounded Agentic Zero-shot Evaluation), a framework that lets a medical VLM work in this iterative way by calling viewer-level tools (zoom, windowing, contrast, edge detection) and two retrieval tools backed by the U.S. National Library of Medicine (PubMed for medical literature, Open-i for radiological images), with structured outputs validated against a schema and full tool-call traces recorded for auditability. On NOVA, a benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE reaches 58.2 mean average precision (mAP) at intersection-over-union (IoU) 0.3 for lesion localisation and 34.9% Top-1 diagnostic accuracy under a joint protocol that scores captioning, diagnosis, and localisation from the image alone, without task-specific fine-tuning. Before any tool is used, structured prompting and schema-validated outputs already improve over the published Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), so framework design is itself an experimental variable. Tool use helps rare pathologies disproportionately: the fraction of cases with IoU > 0.3 rises from 17% to 58% for diagnoses with three or fewer examples versus 25% to 68% for common conditions ($\geq$10 cases), with gains tracking engagement (Gemini 3 Flash: Cohen's d = 0.79, 11.8 tool calls per case; Gemini 2.0 Flash: tools used in 8.2% of cases, no significant benefit). Retrieval ablations additionally reveal a model-dependent trade-off in which gains in diagnosis can coincide with losses in localisation, reinforcing the case for joint evaluation of diagnosis, localisation, and captioning in medical VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces the GAZE framework, which enables vision-language models to perform iterative, agentic zero-shot evaluation on brain MRI by calling viewer-level tools (zoom, windowing, contrast, edge detection) and retrieval tools (PubMed literature and Open-i images), with schema-validated structured outputs and full tool traces for auditability. On the NOVA benchmark of 906 cases spanning 281 rare neurological conditions, GAZE reports 58.2 mAP at IoU 0.3 for lesion localization and 34.9% Top-1 diagnostic accuracy under a joint protocol scoring captioning, diagnosis, and localization from the image alone without task-specific fine-tuning. The work includes before/after comparisons showing structured prompting improves over the Gemini 2.0 Flash baseline (20.2 to 29.4 mAP@0.3), further gains from tool use (especially for rare cases with ≤3 examples), model-dependent retrieval ablations, and quantitative tool-engagement metrics such as Cohen's d and call counts.

Significance. If the results hold, this is a meaningful contribution to medical AI and agentic VLMs because it supplies concrete evidence that grounded tool use and literature retrieval can disproportionately aid rare-condition performance, while advocating joint evaluation of diagnosis, localization, and captioning. Strengths include the external NOVA benchmark, published baselines, ablation studies that surface trade-offs, and auditability features (schema validation, traces) that support reproducibility and falsifiability of the reported 58.2 mAP and 34.9% accuracy figures.

minor comments (3)

Abstract and Methods: the exact prompting template used for the structured-prompting baseline (the 20.2 to 29.4 mAP improvement) and the precise schema definition for output validation are not detailed; adding these would improve replicability of the framework-design result.
Results: the rare/common split (≤3 vs ≥10 cases) and the joint scoring protocol thresholds are defined, but a supplementary table confirming these splits are applied uniformly across all reported statistics (including ablations) would strengthen the disproportionate-benefit claim for rare pathologies.
Results: while Cohen's d and call-count statistics are mentioned, a compact table or figure summarizing per-model tool engagement and the diagnosis-localization trade-off across retrieval ablations would enhance clarity without altering the central claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the GAZE framework, recognition of its significance for agentic VLMs in rare-disease medical imaging, and recommendation for minor revision. The report accurately reflects our contributions on the NOVA benchmark, tool-use ablations, and joint evaluation protocol.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents GAZE as a new agentic framework for VLMs, with performance measured directly on the external NOVA benchmark of 906 cases across 281 rare conditions. Results (58.2 mAP@0.3, 34.9% Top-1 accuracy) are obtained via explicit comparisons to published baselines (Gemini 2.0 Flash at 20.2 mAP@0.3), ablations on tool use/retrieval, and quantitative tracking of engagement metrics, all defined independently of any fitted parameters or self-referential definitions. No equations reduce outputs to inputs by construction, no self-citations bear the central claims, and the joint evaluation protocol uses concrete thresholds and external standards without renaming or smuggling ansatzes. The derivation chain consists of empirical measurements against fixed benchmarks and baselines, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the NOVA benchmark being representative of rare conditions and on the assumption that the introduced tools and retrievals operate correctly without systematic errors or biases; no free parameters are fitted in the reported results.

axioms (2)

domain assumption The NOVA benchmark of 906 cases accurately represents 281 rare neurological conditions without selection bias.
All performance numbers and differential gains for rare vs common cases depend on this benchmark being valid and unbiased.
ad hoc to paper Viewer-level tools (zoom, windowing, contrast, edge detection) and retrieval tools return useful, non-misleading information when called.
The reported improvements from tool use (e.g., 17% to 58% IoU>0.3 for rare cases) assume correct and beneficial tool behavior.

invented entities (1)

GAZE framework no independent evidence
purpose: Agentic system enabling iterative tool use and retrieval for zero-shot medical VLM evaluation
Newly defined system whose performance is the central result; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5685 in / 1863 out tokens · 43633 ms · 2026-05-09T20:51:43.192148+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., et al.: Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 (2025).https://doi.org/10.48550/arXiv.2502.13923

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[2]

In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65–72 (2005)

2005
[3]

Nova: A benchmark for anomaly localization and clinical reasoning in brain mri.arXiv preprint arXiv:2505.14064, 2025

Bercea, C.I., Li, J., Raffler, P.: NOVA: A benchmark for anomaly localization and clinical reasoning in brain MRI. arXiv preprint arXiv:2505.14064 (2025).https: //doi.org/10.48550/arXiv.2505.14064

work page doi:10.48550/arxiv.2505.14064 2025
[4]

Lan- glotz, Michael Krauthammer, and Farhad Nooralahzadeh

Bluethgen, C., Van Veen, D., Truhn, D., et al.: Agentic systems in radiology: Design, applications, evaluation, and challenges. arXiv preprint arXiv:2510.09404 (2025).https://doi.org/10.48550/arXiv.2510.09404

work page doi:10.48550/arxiv.2510.09404 2025
[5]

Goodwin, T

Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical In- formatics Association23(2), 304–310 (2016).https://doi.org/10.1093/jamia/ ocv080

work page doi:10.1093/jamia/ 2016
[6]

Drew, T., Evans, K., Võ, M.L.H., Jacobson, F.L., Wolfe, J.M.: Informatics in radiology: What can you see in a single glance and how might this guide vi- sual search in medical images? RadioGraphics33(1), 263–274 (2013).https: //doi.org/10.1148/rg.331125023

work page doi:10.1148/rg.331125023 2013
[7]

European Parliament and Council of the European Union: Regulation (EU) 2024/1689 of the european parliament and of the council laying down harmonised rules on artificial intelligence (AI Act) (2024)

2024
[8]

doi: 10.1007/s11263-009-0275-4

Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision88(2), 303–338 (2010).https://doi.org/10.1007/s11263-009-0275-4

work page doi:10.1007/s11263-009-0275-4 2010
[9]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Fallahpour, A., Ma, J., Munim, A.: MedRAX: Medical reasoning agent for chest x- ray. arXiv preprint arXiv:2502.02673 (2025).https://doi.org/10.48550/arXiv. 2502.02673

work page internal anchor Pith review doi:10.48550/arxiv 2025
[10]

arXiv preprint arXiv:2504.20930 (2025).https://doi.org/10.48550/arXiv.2504.20930

Fan, Z., Liang, C., Wu, C.: ChestX-Reasoner: Advancing radiology founda- tion models with reasoning through step-by-step verification. arXiv preprint arXiv:2504.20930 (2025).https://doi.org/10.48550/arXiv.2504.20930

work page doi:10.48550/arxiv.2504.20930 2025
[11]

Aura: A multi-modal medical agent for understanding, reasoning & annotation, 2025

Fathi, N., Kumar, A., Arbel, T.: AURA: A multi-modal medical agent for un- derstanding, reasoning & annotation. arXiv preprint arXiv:2507.16940 (2025). https://doi.org/10.48550/arXiv.2507.16940

work page doi:10.48550/arxiv.2507.16940 2025
[12]

Archives of Internal Medicine165(13), 1493–1499 (2005).https://doi.org/10

Graber, M.L., Franklin, N., Gordon, R.: Diagnostic error in internal medicine. Archives of Internal Medicine165(13), 1493–1499 (2005).https://doi.org/10. 1001/archinte.165.13.1493

2005
[13]

Guttag, and Adrian V

Hoopes, A., Dey, N., Butoi, V.I.: VoxelPrompt: A vision agent for end-to-end medi- cal image analysis. arXiv preprint arXiv:2410.08397 (2024).https://doi.org/10. 48550/arXiv.2410.08397

work page arXiv 2024
[14]

In: Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing

Karpukhin, V., Oguz, B., Min, S., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing. pp. 6769–6781 (2020).https://doi.org/10. 18653/v1/2020.emnlp-main.550

2020
[15]

arXiv preprint arXiv:2511.04720 (2025)

Kim, H.Y., Li, J., Solana, A.B., et al.: Learning to reason about rare dis- eases through retrieval-augmented agents. arXiv preprint arXiv:2511.04720 (2025). https://doi.org/10.48550/arXiv.2511.04720 GAZE: Agentic Medical VLM Evaluation on Rare Brain MRI 15

work page doi:10.48550/arxiv.2511.04720 2025
[16]

Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025

Mao, Y., Xu, W., Qin, Y., Gao, Y.: CT-Agent: A multimodal-llm agent for 3d ct radiology question answering. arXiv preprint arXiv:2505.16229 (2025).https: //doi.org/10.48550/arXiv.2505.16229

work page doi:10.48550/arxiv.2505.16229 2025
[17]

GOV.UK report (2025),https://www.gov.uk/government/ publications/ai-airlock-sandbox-pilot-programme-report, published 16 Oc- tober 2025

Medicines and Healthcare products Regulatory Agency: AI Airlock sandbox pi- lot programme report. GOV.UK report (2025),https://www.gov.uk/government/ publications/ai-airlock-sandbox-pilot-programme-report, published 16 Oc- tober 2025

2025
[18]

Qwen blog (2026), https://qwen.ai/blog?id=qwen3.5, accessed April 2026

Qwen Team: Qwen3.5: Towards native multimodal agents. Qwen blog (2026), https://qwen.ai/blog?id=qwen3.5, accessed April 2026

2026
[19]

Capabilities of gemini models in medicine

Saab, K., et al.: Capabilities of Gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024).https://doi.org/10.48550/arXiv.2404.18416

work page doi:10.48550/arxiv.2404.18416 2024
[20]

Nucleic Acids Research50(D1), D20–D26 (2022).https://doi.org/10.1093/nar/gkab1112

Sayers, E.W., Bolton, E.E., Brister, J.R., et al.: Database resources of the national center for biotechnology information. Nucleic Acids Research50(D1), D20–D26 (2022).https://doi.org/10.1093/nar/gkab1112

work page doi:10.1093/nar/gkab1112 2022
[21]

Cxr-agent: Vision-language models for chest x-ray interpretation with uncer- tainty aware radiology reporting, 2024

Sharma, N., Agrawal, P., Banerjee, P.: CXR-Agent: Vision-language models for chest x-ray interpretation with uncertainty aware radiology reporting. arXiv preprintarXiv:2407.08811(2024).https://doi.org/10.48550/arXiv.2407.08811

work page doi:10.48550/arxiv.2407.08811 2024
[22]

Medopenclaw: Auditable medical imaging agents reasoning over uncurated full studies.arXiv preprint arXiv:2603.24649, 2026

Shen, W., Hu, Y., Liu, C., et al.: MedOpenClaw: Auditable medical imaging agents reasoning over uncurated full studies. arXiv preprint arXiv:2603.24649 (2026). https://doi.org/10.48550/arXiv.2603.24649

work page internal anchor Pith review doi:10.48550/arxiv.2603.24649 2026
[23]

In: International Conference on Learning Repre- sentations (2021)

Xiong, L., Xiong, C., Li, Y.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: International Conference on Learning Repre- sentations (2021)

2021
[24]

Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

Yang, L., Xu, S., Sellergren, A., et al.: Advancing multimodal medical capabilities of Gemini. arXiv preprint arXiv:2405.03162 (2024).https://doi.org/10.48550/ arXiv.2405.03162

work page arXiv 2024
[25]

arXiv preprint arXiv:2512.02814 (2025).https://doi.org/10.48550/arXiv.2512.02814

Yu, Y., Huang, Z., Mu, L.: Radiologist copilot: An agentic framework orchestrating specialized tools for reliable radiology reporting. arXiv preprint arXiv:2512.02814 (2025).https://doi.org/10.48550/arXiv.2512.02814

work page doi:10.48550/arxiv.2512.02814 2025