arxiv: 2605.11224 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 3 theorem links

· Lean Theorem

ABRA: Agent Benchmark for Radiology Applications

Bulat Maksudov , Vladislav Kurenkov , Kathleen M. Curran , Alessandra Mileo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords radiology agentsAI benchmarksmedical imagingDICOM viewerperception bottlenecktool callingagent evaluation

0 comments

The pith

Current AI agents reach 89% execution on radiology annotation but only 0-25% outcome success, with oracle perception inputs lifting outcomes to 69-100%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ABRA is a benchmark that places agents inside a real DICOM viewer and server, requiring them to use 21 tools for slice navigation, windowing, annotation, and structured reporting instead of receiving pre-selected images. Ten models were tested across 655 tasks drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST datasets. The models handled tool execution reliably yet produced correct medical outcomes in only a small fraction of cases. Supplying a simulated perfect detector as input raised outcome scores sharply, isolating the failure to visual perception rather than planning or tool use.

Core claim

In the ABRA benchmark, agents operate an OHIF viewer and Orthanc DICOM server through twenty-one function-calling tools spanning navigation, metadata queries, pixel annotation, and BI-RADS reporting. Ten evaluated models achieve at least 89% execution on real annotation tasks but only 0-25% outcome success; the paired oracle variant, which supplies findings from a simulated detector, raises outcome success to 69-100% across the same models and tasks.

What carries the argument

The ABRA environment and its twenty-one tools for viewer control, annotation, and reporting, evaluated through automatic scorers on Planning, Execution, and Outcome dimensions.

If this is right

Tool orchestration itself is not the primary obstacle for current radiology agents.
Perception accuracy directly determines whether an agent can complete diagnostic or reporting goals.
Interactive viewer benchmarks expose limitations hidden by static-image evaluation setups.
Medical agents will require tighter coupling between vision modules and action tools to reach usable performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Advances in medical vision models could translate quickly into functional agent systems once perception improves.
Similar bottlenecks may appear in other domains where agents must interpret complex visual data before acting.
Future benchmarks could add live human-in-the-loop scoring to validate the automatic metrics.

Load-bearing premise

The automatic scorers correctly measure whether an agent has solved the medical task, and the programmatically generated tasks faithfully capture the difficulties of actual radiology work.

What would settle it

Running the identical ABRA annotation and reporting tasks with expert radiologists in the live viewer and comparing their outcome scores to the models, or testing newer vision models to check whether outcome scores rise without oracle inputs.

Figures

Figures reproduced from arXiv: 2605.11224 by Alessandra Mileo, Bulat Maksudov, Kathleen M. Curran, Vladislav Kurenkov.

**Figure 2.** Figure 2: Task distribution by difficulty and type for ABRA. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Agent execution path inside OHIF. A user prompt (D) submitted via the embedded [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Agent system prompt template. {task_description} is the per-task string from [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABRA's interactive radiology viewer setup with real DICOM tasks and oracle variants is a genuine step forward for agent benchmarks, though the perception-bottleneck claim rests on unvalidated automatic scorers.

read the letter

The main thing to know is that ABRA turns radiology agent evaluation into an actual environment the model must control, rather than handing it pre-chosen slices. The agent gets 21 tools for slice navigation, windowing, annotation, and structured reporting against an OHIF viewer and Orthanc server, with 655 tasks drawn from LIDC-IDRI, Duke Breast MRI, and NLST. That shift, plus the oracle variants that swap in a simulated detector, is the clearest novelty here and lets them show execution staying high (≥89%) while outcome collapses (0-25%) until the oracle is supplied (69-100%). They release the task generators, scorers, and code, which is the right move for a benchmark paper.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces ABRA, a benchmark for radiology agents operating an OHIF viewer and Orthanc DICOM server through 21 function-calling tools for slice navigation, windowing, annotation, and structured reporting. It evaluates 10 models (5 closed, 5 open) on 655 programmatically generated tasks across 8 types and 3 difficulty tiers drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST. Results show ≥89% Execution but only 0-25% Outcome on real annotation/BI-RADS tasks, rising to 69-100% on oracle variants with simulated detectors, localizing the performance bottleneck to perception rather than tool use. Code, generators, and scorers are released.

Significance. If the automatic scorers are shown to be reliable, this work supplies a valuable interactive benchmark that moves beyond static-image medical-agent evaluations and supplies concrete evidence of a perception bottleneck in current VLMs. The public release of task generators and scorers is a clear strength that supports reproducibility and extension by the community.

major comments (2)

[§4 and §4.3] §4 (Evaluation Protocol) and §4.3 (Automatic Scorers): The headline localization of the bottleneck to perception rests on the large Outcome gap (0-25% real vs. 69-100% oracle). The paper relies on task-type-specific automatic scorers (citing Bluethgen et al. 2025) but reports no inter-rater agreement study, human calibration set, or validation against expert radiologists. Because Execution is already ≥89%, any systematic bias in the Outcome scorer (e.g., penalizing semantically correct but stylistically variant reports) directly affects the central claim.
[§3.2] §3.2 (Task Generation): The 655 tasks are programmatically derived from the cited public datasets, yet the manuscript provides no quantitative assessment of how faithfully the generated tasks capture the distribution of real radiology challenges (ambiguous findings, multi-lesion cases, or clinical workflow variability). This limits the strength of the generalization that the observed perception bottleneck applies to clinical practice.

minor comments (3)

[Table 2] Table 2 (Model Performance): The per-model Outcome scores would benefit from confidence intervals or statistical tests to support the claim that the oracle improvement is consistent across all ten models.
[Figure 1] Figure 1 (Tool Taxonomy): The diagram of the 21 tools would be clearer if it explicitly grouped tools by category (navigation, annotation, reporting) and indicated which are used in the oracle variants.
[References] The reference list should include the full bibliographic details for Bluethgen et al. 2025 if it is not already present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript introducing ABRA. We appreciate the emphasis on the reliability of the automatic scorers and the fidelity of the task generation process. Below, we provide point-by-point responses to the major comments and indicate the revisions we will incorporate in the updated version.

read point-by-point responses

Referee: [§4 and §4.3] §4 (Evaluation Protocol) and §4.3 (Automatic Scorers): The headline localization of the bottleneck to perception rests on the large Outcome gap (0-25% real vs. 69-100% oracle). The paper relies on task-type-specific automatic scorers (citing Bluethgen et al. 2025) but reports no inter-rater agreement study, human calibration set, or validation against expert radiologists. Because Execution is already ≥89%, any systematic bias in the Outcome scorer (e.g., penalizing semantically correct but stylistically variant reports) directly affects the central claim.

Authors: We agree that additional validation of the automatic scorers would strengthen the central claim regarding the perception bottleneck. While the scorers follow the methodology validated in Bluethgen et al. (2025), the current manuscript does not present a dedicated inter-rater study. In the revised manuscript, we will add a new subsection under §4.3 that includes a human calibration study. Specifically, we will have two expert radiologists independently score a random sample of 100 tasks (50 annotation and 50 BI-RADS) and report Cohen's kappa agreement between human and automatic scores, as well as between the two radiologists. This will allow us to quantify any potential bias in the Outcome metric. revision: yes
Referee: [§3.2] §3.2 (Task Generation): The 655 tasks are programmatically derived from the cited public datasets, yet the manuscript provides no quantitative assessment of how faithfully the generated tasks capture the distribution of real radiology challenges (ambiguous findings, multi-lesion cases, or clinical workflow variability). This limits the strength of the generalization that the observed perception bottleneck applies to clinical practice.

Authors: We acknowledge the value of demonstrating how well the programmatically generated tasks reflect real-world radiology distributions. The tasks are derived directly from the annotations and metadata in LIDC-IDRI, Duke Breast Cancer MRI, and NLST, which are established clinical datasets. To address this, we will include in the revised §3.2 a quantitative comparison: for example, histograms of lesion diameters and counts per case in ABRA tasks versus the source datasets' reported statistics. We will also add a limitations paragraph noting that ABRA prioritizes tasks amenable to automatic scoring and does not fully capture highly ambiguous or workflow-variable cases, suggesting this as an avenue for future extensions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct evaluations

full rationale

The paper introduces a radiology agent benchmark and reports model performance via direct execution on programmatically generated tasks drawn from public datasets. Scoring relies on an external citation (Bluethgen et al. 2025) rather than self-citation or internal fitting. The real-vs-oracle comparison is a controlled experimental contrast, not a derivation that reduces to its own inputs by construction. No equations, parameter fitting, uniqueness theorems, or ansatzes appear; results are released artifacts evaluated on the defined tasks. This is a standard empirical benchmark study whose central claims rest on observable performance gaps rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper builds upon existing medical imaging datasets and adopts a scoring framework from prior work (Bluethgen et al., 2025). No free parameters or new entities are introduced; the novelty lies in the tool interface and task suite.

axioms (1)

domain assumption Scoring along Planning, Execution, and Outcome as per Bluethgen et al., 2025
The paper adopts this prior scoring method for evaluating agent episodes.

pith-pipeline@v0.9.0 · 5539 in / 1386 out tokens · 132671 ms · 2026-05-13T06:47:40.905220+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ABRA contains 655 programmatically generated tasks... scored along Planning, Execution, and Outcome by task-type-specific automatic scorers... localising the bottleneck to perception rather than tool orchestration.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each episode is scored along Planning, Execution, and Outcome... composite S=0.20P+0.30E+0.50O
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ten current models... reach at least 89% Execution on real annotation but only 0-25% Outcome

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

2024 , url=

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , booktitle=. 2024 , url=

work page 2024
[2]

2025 , eprint=

Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges , author=. 2025 , eprint=

work page 2025
[3]

and Grimm, Lars J

Saha, Ashirbani and Harowicz, Michael R. and Grimm, Lars J. and Weng, Jingxi and Cain, E. H. and Kim, C. E. and Ghate, S. V. and Walsh, R. and Mazurowski, Maciej A. , title =. 2021 , copyright =. doi:10.7937/TCIA.E3SV-RE93 , url =

work page doi:10.7937/tcia.e3sv-re93 2021
[4]

2025 , copyright =

Gong, Andie and Daly, Morgan and Goldin, Jonathan and Brown, Matthew and McNItt-Gray, Michael and Ruchalski, Kathleen , title =. 2025 , copyright =. doi:10.7937/EYVH-AG54 , url =

work page doi:10.7937/eyvh-ag54 2025
[5]

2013.doi: 10.7937/TCIA.HMQ8-J677

Data from the. 2013 , publisher =. doi:10.7937/TCIA.HMQ8-J677 , url =

work page doi:10.7937/tcia.hmq8-j677 2013
[6]

Journal of Digital Imaging , year =

Clark, Kenneth and Vendt, Bruce and Smith, Kirk and Freymann, John and Kirby, Justin and Koppel, Paul and Moore, Stephen and Phillips, Stanley and Maffitt, David and Pringle, Michael and Tarbox, Lawrence and Prior, Fred , title =. Journal of Digital Imaging , year =

work page
[7]

Open Health Imaging Foundation Viewer: An extensible open-source framework for building Web-based imaging applications to support Cancer Research

Ziegler, Erik and Urban, Trinity and Brown, Danny and Petts, James and Pieper, Steve D and Lewis, Rob and Hafey, Chris and Harris, Gordon J. Open Health Imaging Foundation Viewer: An extensible open-source framework for building Web-based imaging applications to support Cancer Research. JCO Clin. Cancer Inform

work page
[8]

The O rthanc Ecosystem for Medical Imaging

Jodogne, S \'e bastien. The O rthanc Ecosystem for Medical Imaging. Journal of Digital Imaging. 2018. doi:10.1007/s10278-018-0082-y

work page doi:10.1007/s10278-018-0082-y 2018
[9]

The Lung Image Database Consortium ( LIDC ) and Image Database Resource Initiative ( IDRI) : a completed reference database of lung nodules on CT scans

Armato, 3rd, Samuel G and McLennan, Geoffrey and Bidaut, Luc and McNitt-Gray, Michael F and Meyer, Charles R and Reeves, Anthony P and Zhao, Binsheng and Aberle, Denise R and Henschke, Claudia I and Hoffman, Eric A and Kazerooni, Ella A and MacMahon, Heber and Van Beeke, Edwin J R and Yankelevitz, David and Biancardi, Alberto M and Bland, Peyton H and Bro...

work page
[10]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[11]

The Thirteenth International Conference on Learning Representations , year=

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[12]

The Thirteenth International Conference on Learning Representations , year=

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[13]

The Thirteenth International Conference on Learning Representations , year=

AgentStudio: A Toolkit for Building General Virtual Agents , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[14]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

work page 2024
[15]

2025 , url=

John Yang and Carlos E Jimenez and Alex L Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R Narasimhan and Diyi Yang and Sida Wang and Ofir Press , booktitle=. 2025 , url=

work page 2025
[16]

2025 , url=

Jun Shern Chan and Neil Chowdhury and Oliver Jaffe and James Aung and Dane Sherburn and Evan Mays and Giulio Starace and Kevin Liu and Leon Maksin and Tejal Patwardhan and Aleksander Madry and Lilian Weng , booktitle=. 2025 , url=

work page 2025
[17]

The Twelfth International Conference on Learning Representations , year=

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems , author=. The Twelfth International Conference on Learning Representations , year=

work page
[18]

The Thirteenth International Conference on Learning Representations , year=

Commit0: Library Generation from Scratch , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[19]

AgentBench: Evaluating

Xiao Liu and Hao Yu and Hanchen Zhang and Yifan Xu and Xuanyu Lei and Hanyu Lai and Yu Gu and Hangliang Ding and Kaiwen Men and Kejuan Yang and Shudan Zhang and Xiang Deng and Aohan Zeng and Zhengxiao Du and Chenhui Zhang and Sheng Shen and Tianjun Zhang and Yu Su and Huan Sun and Minlie Huang and Yuxiao Dong and Jie Tang , booktitle=. AgentBench: Evaluat...

work page 2024
[20]

The Twelfth International Conference on Learning Representations , year=

Gr. The Twelfth International Conference on Learning Representations , year=

work page
[21]

2025 , url=

Liqiang Jing and Zhehui Huang and Xiaoyang Wang and Wenlin Yao and Wenhao Yu and Kaixin Ma and Hongming Zhang and Xinya Du and Dong Yu , booktitle=. 2025 , url=

work page 2025
[22]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

DiscoveryWorld: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[23]

and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =

Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =. 2024 , editor =

work page 2024
[24]

Proceedings of the 41st International Conference on Machine Learning , pages =

Lu, Xing Han and Kasner, Zden. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

work page 2024
[25]

2024 , editor =

Huang, Qian and Vora, Jian and Liang, Percy and Leskovec, Jure , booktitle =. 2024 , editor =

work page 2024
[26]

Training Software Engineering Agents and Verifiers with

Pan, Jiayi and Wang, Xingyao and Neubig, Graham and Jaitly, Navdeep and Ji, Heng and Suhr, Alane and Zhang, Yizhe , booktitle =. Training Software Engineering Agents and Verifiers with. 2025 , editor =

work page 2025
[27]

Mind2Web: Towards a Generalist Agent for the Web , url =

Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu , booktitle =. Mind2Web: Towards a Generalist Agent for the Web , url =

work page
[28]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url =

work page
[29]

AgentClinic: a multimodal agent benchmark to evaluate

Samuel Schmidgall and Rojin Ziaei and Carl William Harris and Ji Woong Kim and Eduardo Pontes Reis and Jeffrey K Jopling and Michael Moor , year =. AgentClinic: a multimodal agent benchmark to evaluate

work page
[30]

MedChain: Bridging the Gap Between

Jie Liu and Wenxuan Wang and Zizhan Ma and Guolin Huang and Yihang Su and Kao-Jung Chang and Haoliang Li and Linlin Shen and Michael Lyu and Wenting Chen , booktitle =. MedChain: Bridging the Gap Between. 2025 , url =

work page 2025
[31]

NEJM AI , volume =

MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents , author =. NEJM AI , volume =. 2025 , month = aug, publisher =. doi:10.1056/AIdbp2500144 , url =

work page doi:10.1056/aidbp2500144 2025
[32]

Nature Medicine , volume =

An evaluation framework for clinical use of large language models in patient interaction tasks , author =. Nature Medicine , volume =. 2025 , doi =

work page 2025
[33]

The Fourteenth International Conference on Learning Representations , year =

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning , author =. The Fourteenth International Conference on Learning Representations , year =

work page
[34]

2025 , eprint =

Sequential Diagnosis with Language Models , author =. 2025 , eprint =

work page 2025
[35]

How well can modern LLMs act as agent cores in radiology environments?, 2024

Qiaoyu Zheng and Chaoyi Wu and Pengcheng Qiu and Lisong Dai and Ya Zhang and Yanfeng Wang and Weidi Xie , year =. How Well Can Modern. 2412.09529 , archivePrefix =

work page arXiv
[36]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year =

work page
[37]

2025 , eprint =

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning , author =. 2025 , eprint =

work page 2025
[38]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[39]

2026 , eprint=

Ministral 3 , author=. 2026 , eprint=

work page 2026
[40]

2026 , eprint=

OpenAI GPT-5 System Card , author=. 2026 , eprint=

work page 2026
[41]

2026 , howpublished =

work page 2026
[42]

2025 , howpublished =

work page 2025
[43]

2026 , month = apr, howpublished =

work page 2026
[44]

2025 , month = dec, howpublished =

Introducing. 2025 , month = dec, howpublished =

work page 2025
[45]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

work page