pith. machine review for the scientific record. sign in

arxiv: 2605.11438 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Beyond Masks: The Case for Medical Image Parsing

Alan L. Yuille, Siddharth Gupta, Zongwei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical image parsingstructured representationsegmentation masksentitiesattributesrelationshipsmedical imagingradiology reports
0
0 comments X

The pith

Medical imaging should replace per-voxel masks with structured parses that name entities, attributes, and relationships together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

After a decade of progress on per-voxel masks that report size, volume, and location, the paper argues medical imaging should center on medical image parsing instead. This produces a single structured output in which named entities such as structures and findings, their attributes such as margin regularity or enhancement pattern, and their relationships such as spatial position or change since a prior scan are emitted together and kept consistent. A good parse must first decide the right content for the current image, then contain enough detail to reconstruct that image, and finally support prediction of how the patient state will evolve, with all quantitative measurements derived from this content rather than produced separately. An audit of eleven representative systems shows that entities are largely solved but attributes, relationships, and closure remain missing. The result would align model outputs more closely with the content of actual radiologist reports.

Core claim

Medical imaging should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. A good parse satisfies three properties, in order: decision (names the right things in the current image), reconstruction (content is rich enough to regenerate that image), and prediction (content is rich enough to forecast patient evolution). Quantitative measurements are derived from this content; they are not predicted alongside it.

What carries the argument

Medical image parsing, a structured representation of entities (named structures and findings), attributes (descriptions such as margin regularity or severity), and relationships (connections such as relative position or temporal change), required to satisfy decision, reconstruction, and prediction in that order.

If this is right

  • Quantitative measurements would be computed from the parse content rather than generated as separate predictions.
  • Training objectives would shift to reward mutual consistency among entities, attributes, and relationships.
  • Clinical reports could be produced directly from the parse instead of from masks.
  • Model outputs would explain image content rather than only measure it.
  • Existing mask-based clinical infrastructure would need replacement by parse-based systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such parses could allow direct generation of natural-language reports that match radiologist style more closely.
  • New datasets would be required that explicitly label attributes and relationships in addition to entities.
  • Longitudinal tracking of disease could improve because relationships explicitly encode change over time.
  • Integration with electronic health records might become simpler if the output already contains the structured elements used in reports.

Load-bearing premise

That a structured parse meeting the three properties can be produced at scale and will prove clinically better than masks without creating new failure modes.

What would settle it

A controlled test in which a system emitting such a parse fails to produce more accurate clinical reports, image reconstructions, or patient predictions than current mask-based systems.

Figures

Figures reproduced from arXiv: 2605.11438 by Alan L. Yuille, Siddharth Gupta, Zongwei Zhou.

Figure 1
Figure 1. Figure 1: Medical image parsing extends segmentation by emitting a layered structured output. From [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example false-positive organ mask. In public CT case s1223, the reference left-kidney [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example medical image parse. Two entities are present ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that medical imaging research should shift its central output from per-voxel segmentation masks to structured medical image parses consisting of entities, attributes, relationships, and closure. These parses must satisfy three properties in order: decision (naming the right things), reconstruction (rich enough to regenerate the image), and prediction (rich enough to forecast evolution). An audit of eleven representative systems is presented, concluding that entities are largely solved but attributes, relationships, and closure remain near-empty. The path forward is a commitment to richer outputs and appropriate training signals.

Significance. If the proposed framework holds and the audit is substantiated, this work could redirect the field toward more clinically relevant, explainable outputs that align better with radiologist reports. It highlights a conceptual gap between current mask-based methods and the structured information needed for decision-making and prognosis. However, without concrete examples of such parses or methods to enforce the three properties, the significance depends on future validation that the approach avoids new failure modes and is superior to masks.

major comments (2)
  1. [Audit of Current Systems] The audit of eleven systems is central to the claim that the field is far from producing well-formed parses. However, the selection criteria for the systems, the detailed evaluation protocol for assessing decision, reconstruction, prediction, and closure, and the scoring details are not provided. This makes it impossible to verify the conclusion that 'None emits a well-formed parse' and that attributes, relationships, and closure are 'near-empty'. Provide these details in the audit section to support the evidence.
  2. [Definition of Parsing Properties] The reconstruction property requires the parse to be rich enough to regenerate the original image. This could make the parse informationally equivalent to a dense mask or require an accompanying generative model, eroding the claimed advantage over masks. No derivation or example is given showing how the three properties can be jointly enforced via training signals without introducing issues like annotation explosion or inconsistent closure. This is load-bearing for the central argument that parsing is both achievable and superior.
minor comments (2)
  1. [Abstract] The phrase 'the report a radiologist writes contains almost nothing a mask can express' is a strong claim; consider adding a brief example contrasting a typical mask output with what a parse would provide.
  2. [Introduction] The term 'closure' is used in the abstract and early text but its precise meaning (e.g., consistency across the parse) is clarified only later; define it upon first use for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify how the manuscript can better support its central claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Audit of Current Systems] The audit of eleven systems is central to the claim that the field is far from producing well-formed parses. However, the selection criteria for the systems, the detailed evaluation protocol for assessing decision, reconstruction, prediction, and closure, and the scoring details are not provided. This makes it impossible to verify the conclusion that 'None emits a well-formed parse' and that attributes, relationships, and closure are 'near-empty'. Provide these details in the audit section to support the evidence.

    Authors: We agree that the audit requires greater transparency to allow independent verification. In the revised manuscript we will expand the audit section with the explicit selection criteria used to identify the eleven representative systems, the step-by-step evaluation protocol applied to each parsing property (decision, reconstruction, prediction, and closure), and the scoring rubric employed. These additions will directly support the reported conclusion that none of the systems produces a well-formed parse and that attributes, relationships, and closure remain near-empty. revision: yes

  2. Referee: [Definition of Parsing Properties] The reconstruction property requires the parse to be rich enough to regenerate the original image. This could make the parse informationally equivalent to a dense mask or require an accompanying generative model, eroding the claimed advantage over masks. No derivation or example is given showing how the three properties can be jointly enforced via training signals without introducing issues like annotation explosion or inconsistent closure. This is load-bearing for the central argument that parsing is both achievable and superior.

    Authors: The reconstruction property is defined at the semantic level: the parse must contain sufficient named entities, attributes, and relationships to permit regeneration of the image's clinically relevant structures, not pixel-wise equivalence to a dense mask. This preserves the interpretability and compactness advantages over masks. We acknowledge that the original manuscript provides neither a formal derivation nor concrete examples of joint enforcement. In the revision we will add an illustrative example of a parse satisfying the three properties in order and a brief discussion of possible training signals, while clarifying that full resolution of annotation and consistency challenges is left to future empirical work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; conceptual argument is self-contained

full rationale

The paper advances a position that medical imaging should shift from per-voxel masks to structured parsing satisfying decision, reconstruction, and prediction properties, supported by an audit of eleven external systems. No equations, fitted parameters, or derivations appear. The central claim does not reduce to self-definition, renamed empirical patterns, or load-bearing self-citations; the audit and three-property framework are presented as independent motivation rather than tautological restatement of inputs. This is the expected outcome for a non-mathematical advocacy paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the paper introduces the three parsing properties and the audit conclusion but provides no formal axioms, free parameters, or invented entities with independent evidence.

pith-pipeline@v0.9.0 · 5581 in / 1051 out tokens · 38046 ms · 2026-05-13T01:37:19.340211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Anja Thieme, Sam Bond- Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, and Step...

  2. [2]

    Pedro RAS Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Youse...

  3. [3]

    Label critic: Design data before models

    Pedro RAS Bassi, Qilong Wu, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Alan Yuille, and Zongwei Zhou. Label critic: Design data before models. InIEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025. URLhttps://github.com/PedroRASB/ LabelCritic

  4. [4]

    Radgpt: Constructing 3d image-text tumor datasets

    Pedro RAS Bassi, Mehmet Can Yavuz, Ibrahim Ethem Hamamci, Sezgin Er, Xiaoxi Chen, Wenxuan Li, Bjoern Menze, Sergio Decherchi, Andrea Cavalli, Kang Wang, Yang Yang, Alan Yuille, and Zongwei Zhou. Radgpt: Constructing 3d image-text tumor datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23720–23730, 2025. URLhttps://...

  5. [5]

    Scaling artificial intelligence for multi-tumor early detection with more reports, fewer masks.arXiv preprint arXiv:2510.14803,

    Pedro RAS Bassi, Xinze Zhou, Wenxuan Li, Szymon Płotka, Jieneng Chen, Qi Chen, Zheren Zhu, Jakub Prz ˛ ado, Ibrahim E Hamamci, Sezgin Er, Xiaoxi Chen, Mehmet Can Yavuz, Yu- Cheng Chou, Tianyu Lin, Kang Wang, Yucheng Tang, Jaroslaw B Cwikla, Sergio Decherchi, Andrea Cavalli, Yang Yang, Alan L Yuille, and Zongwei Zhou. Scaling artificial intelligence for mu...

  6. [6]

    URLhttps://github.com/MrGiovanni/R-Super

  7. [7]

    PadChest-GR: A bilingual chest x-ray dataset for grounded radiology report generation.NEJM AI, 2(7), 2025

    Daniel C Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes 10 Pérez-Rodríguez, Kenji Takeda, José María Salinas, Javier Alvarez-Valle, Joaquín Galant Her- rero, and Antonio Pertusa. PadChest-GR: A bilingual chest x-ray dataset for grounded radiolog...

  8. [8]

    Towards generalizable tumor synthesis

    Qi Chen, Xiaoxi Chen, Haorui Song, Zhiwei Xiong, Alan Yuille, Chen Wei, and Zongwei Zhou. Towards generalizable tumor synthesis. InIEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 11147–11158, 2024. URL https://github.com/ MrGiovanni/DiffTumor

  9. [9]

    Scaling tumor segmentation: Best lessons from real and synthetic data

    Qi Chen, Xinze Zhou, Chen Liu, Hao Chen, Wenxuan Li, Zekun Jiang, Ziyan Huang, Yuxuan Zhao, Dexin Yu, Junjun He, Yefeng Zheng, Ling Shao, Alan Yuille, and Zongwei Zhou. Scaling tumor segmentation: Best lessons from real and synthetic data. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 24001–24013, 2025. URL https://gi...

  10. [10]

    Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025

    Yixiong Chen, Wenjie Xiao, Pedro RAS Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, and Alan Yuille. Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025. URLhttps://github.com/Schuture/DeepTumorVQA

  11. [11]

    Liver imaging reporting and data system (LI-RADS) version 2018: Imaging of hepato- cellular carcinoma in at-risk patients.Radiology, 289(3):816–830, 2018

    Victoria Chernyak, Kathryn J Fowler, Aya Kamaya, Ania Z Kielar, Khaled M Elsayes, Mustafa R Bashir, Yuko Kono, Richard K Do, Donald G Mitchell, Amit G Singal, An Tang, and Claude B Sirlin. Liver imaging reporting and data system (LI-RADS) version 2018: Imaging of hepato- cellular carcinoma in at-risk patients.Radiology, 289(3):816–830, 2018

  12. [12]

    New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1)

    E A Eisenhauer, P Therasse, J Bogaerts, L H Schwartz, D Sargent, R Ford, J Dancey, S Arbuck, S Gwyther, M Mooney, L Rubinstein, L Shankar, L Dodd, R Kaplan, D Lacombe, and J Verweij. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). European Journal of Cancer, 45(2):228–247, 2009

  13. [13]

    An artificial agent for anatomical landmark detection in medical images

    Florin C Ghesu, Bogdan Georgescu, Tommaso Mansi, Dominik Neumann, Joachim Hornegger, and Dorin Comaniciu. An artificial agent for anatomical landmark detection in medical images. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 229–237. Springer, 2016

  14. [14]

    Representing part-whole hierarchies in foundation models by learning localizability, composability, and decomposability from anatomy via self-supervision

    Mohammad Reza Hosseinzadeh Taher, Michael B Gotway, and Jianming Liang. Representing part-whole hierarchies in foundation models by learning localizability, composability, and decomposability from anatomy via self-supervision. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11269–11281, 2024

  15. [15]

    Label-free liver tumor segmentation

    Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan L Yuille, and Zongwei Zhou. Label-free liver tumor segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7422–7432, 2023. URL https://github.com/ MrGiovanni/SyntheticTumors

  16. [16]

    Localized gaussian splatting editing with contextual awareness

    Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, and Jianming Liang. Foundation X: Integrating classification, localization, and segmentation through lock-release pretraining strategy for chest x-ray analysis. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3647–3656, 2025. doi: 10.1109/W AC...

  17. [17]

    RadGraph: Extracting clinical entities and relations from radiology reports

    Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis P Langlotz, and Pranav Rajpurkar. RadGraph: Extracting clinical entities and relations from radiology reports. InNeurIPS Datasets and Benchmarks Track, 2021

  18. [18]

    CheXRelNet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays

    Gaurang Karwande, Amarachi B Mbakwe, Joy T Wu, Leo Anthony Celi, Mehdi Moradi, and Ismini Lourentzou. CheXRelNet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 581–591. Springer, 2022. 11

  19. [19]

    From pixel to cancer: Cellular automata in computed tomography

    Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 36–46. Springer, 2024. URLhttps://github.com/MrGiovanni/Pixel2Cancer

  20. [20]

    Abdomenatlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking.Medical Image Analysis, page 103285, 2024

    Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, and Zongwei Zhou. Abdomenatlas: A large-scale, detailed-annotated, & multi-center dataset for efficie...

  21. [21]

    Scalemai: Accelerating the development of trusted datasets and ai models.arXiv preprint arXiv:2501.03410, 2025

    Wenxuan Li, Pedro RAS Bassi, Tianyu Lin, Yu-Cheng Chou, Xinze Zhou, Yucheng Tang, Fabian Isensee, Kang Wang, Qi Chen, Xiaowei Xu, Jin Ye, Zheren Zhu, Sergio Decherchi, Andrea Cavalli, Alan L Yuille, and Zongwei Zhou. Scalemai: Accelerating the development of trusted datasets and ai models.arXiv preprint arXiv:2501.03410, 2025. URL https: //github.com/MrGi...

  22. [22]

    Pants: The pancreatic tumor segmentation dataset

    Wenxuan Li, Xinze Zhou, Qi Chen, Tianyu Lin, Pedro RAS Bassi, Xiaoxi Chen, Chen Ye, Zheren Zhu, Kai Ding, Heng Li, Kang Wang, Yang Yang, Yucheng Tang, Daguang Xu, Alan L Yuille, and Zongwei Zhou. Pants: The pancreatic tumor segmentation dataset. InConference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps:/...

  23. [23]

    Text-driven tumor synthesis.IEEE Transactions on Medical Imaging, 2026

    Xinran Li, Yi Shuai, Chen Liu, Qi Chen, Tianyu Lin, Pengfei Guo, Dong Yang, Can Zhao, Pedro RAS Bassi, Daguang Xu, Kang Wang, Yang Yang, Alan L Yuille, and Zongwei Zhou. Text-driven tumor synthesis.IEEE Transactions on Medical Imaging, 2026. URL https: //github.com/MrGiovanni/TextoMorph

  24. [24]

    Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography.Medical Image Analysis, page 103226, 2024

    Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography.Medical Image Analysis, page 103226, 2024. URL https://github.com/ ljwztc/CLIP-Drive...

  25. [25]

    Auditing significance, metric choice, and demographic fairness in med- ical ai challenges.arXiv preprint arXiv:2512.19091, 2025

    Ariel Lubonja, Pedro RAS Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L Yuille, and Zongwei Zhou. Auditing significance, metric choice, and demographic fairness in med- ical ai challenges.arXiv preprint arXiv:2512.19091, 2025. URL https://github.com/ ariellubonja/RankInsight

  26. [26]

    Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner Society 2017.Radiology, 284(1):228–243, 2017

    Heber MacMahon, David P Naidich, Jin Mo Goo, Kyung Soo Lee, Ann N C Leung, John R Mayo, Atul C Mehta, Yoshiharu Ohno, Charles A Powell, Mathias Prokop, Geoffrey D Rubin, Cornelia M Schaefer-Prokop, William D Travis, Paul E Van Schil, and Alexander A Bankier. Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischn...

  27. [27]

    Medsegfactory: Text-guided generation of medical image-mask pairs

    Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, and Yuyin Zhou. Medsegfactory: Text-guided generation of medical image-mask pairs. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 21525–21535, 2025. URLhttps://github.com/jwmao1/MedSegFactory

  28. [28]

    ME-VLIP: A modular and efficient vision-language framework for generalizable medical image parsing

    Mai A Shaaban, Amal Saqib, Shahad Emad Hardan, Darya Taratynova, Tausifa Jan Saleem, and Mohammad Yaqub. ME-VLIP: A modular and efficient vision-language framework for generalizable medical image parsing. InMICCAI 2025 Challenge FLARE, 2025. https: //openreview.net/forum?id=0LYAs4b8T6

  29. [29]

    Nomenclature of hepatic anatomy and resections: a review of the Brisbane 2000 system.Journal of Hepato-Biliary-Pancreatic Surgery, 12(5):351–355, 2005

    Steven M Strasberg. Nomenclature of hepatic anatomy and resections: a review of the Brisbane 2000 system.Journal of Hepato-Biliary-Pancreatic Surgery, 12(5):351–355, 2005

  30. [30]

    Interactive and explainable region-guided radiology report generation

    Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7433–7442, 2023. 12

  31. [31]

    Yuille, and Song-Chun Zhu

    Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition.International Journal of Computer Vision, 63(2): 113–140, 2005

  32. [32]

    Chest ImaGenome dataset for clinical reasoning

    Joy T Wu, Nkechinyere N Agu, Ismini Lourentzou, Arjun Sharma, Joseph A Paguio, Jasper S Yao, Edward C Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo A Celi, and Mehdi Moradi. Chest ImaGenome dataset for clinical reasoning. InNeurIPS Datasets and Benchmarks Track, 2021

  33. [33]

    Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, and Alan L. Yuille. Synthesize then compare: Detecting failures and anomalies for semantic segmentation.European Conference on Computer Vision (ECCV), pages 145–161, 2020

  34. [34]

    Exploiting structural consistency of chest anatomy for unsupervised anomaly detection in radiography images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Tiange Xiang, Yixiao Zhang, Yongyi Lu, Alan Yuille, Chaoyi Zhang, Weidong Cai, and Zongwei Zhou. Exploiting structural consistency of chest anatomy for unsupervised anomaly detection in radiography images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. URLhttps://github.com/MrGiovanni/SimSID

  35. [35]

    Prior-RadGraphFormer: A prior-knowledge-enhanced transformer for generating radiology graphs from X-rays

    Yiheng Xiong, Jingsong Liu, Kamilia Zaripova, Sahand Sharifzadeh, Matthias Keicher, and Nassir Navab. Prior-RadGraphFormer: A prior-knowledge-enhanced transformer for generating radiology graphs from X-rays. InGraphs in Biomedical Image Analysis (GRAIL) Workshop at MICCAI, volume 14373 ofLecture Notes in Computer Science, pages 54–63. Springer, 2023

  36. [36]

    Medical world model: Generative simulation of tumor evolution for treatment planning.arXiv preprint arXiv:2506.02327, 2025

    Yijun Yang, Zhao-Yang Wang, Qiuping Liu, Shuwen Sun, Kang Wang, Rama Chellappa, Zongwei Zhou, Alan Yuille, Lei Zhu, Yu-Dong Zhang, and Jieneng Chen. Medical world model: Generative simulation of tumor evolution for treatment planning.arXiv preprint arXiv:2506.02327, 2025. URLhttps://github.com/scott-yjyang/MeWM

  37. [37]

    Automated chest x-ray report generation remains unsolved

    Xiaoman Zhang, Julian Nicolas Acosta, Xiaoli Yang, Subathra Adithan, Luyang Luo, Hong-Yu Zhou, Joshua Miller, Ouwen Huang, Zongwei Zhou, Ibrahim Ethem Hamamci, Shruthi Bannur, Kenza Bouzid, Xi Zhang, Zaiqiao Meng, Aaron Nicolson, Bevan Koopman, Inhyeok Baek, Hanbin Ko, Mercy Prasanna Ranjit, Shaury Srivastav, Sriram Gnana Sambanthan, and Pranav Rajpurkar....

  38. [38]

    A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities.Nature Methods, 22(1):166–176, 2025

    Theodore Zhao, Yu Gu, Jianwei Yang, Naoto Usuyama, Ho Hin Lee, Sid Kiblawi, Tristan Naumann, Jianfeng Gao, Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon, and Sheng Wang. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities.Nature Methods, 22...

  39. [39]

    Large-vocabulary segmentation for medical images with text prompts.npj Digital Medicine, 8:566, 2025

    Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.npj Digital Medicine, 8:566, 2025. doi: 10.1038/s41746-025-01964-w. 13