arxiv: 2605.11438 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Beyond Masks: The Case for Medical Image Parsing

Alan L. Yuille, Siddharth Gupta, Zongwei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image parsingstructured representationsegmentation masksentitiesattributesrelationshipsmedical imagingradiology reports

0 comments

The pith

Medical imaging should replace per-voxel masks with structured parses that name entities, attributes, and relationships together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

After a decade of progress on per-voxel masks that report size, volume, and location, the paper argues medical imaging should center on medical image parsing instead. This produces a single structured output in which named entities such as structures and findings, their attributes such as margin regularity or enhancement pattern, and their relationships such as spatial position or change since a prior scan are emitted together and kept consistent. A good parse must first decide the right content for the current image, then contain enough detail to reconstruct that image, and finally support prediction of how the patient state will evolve, with all quantitative measurements derived from this content rather than produced separately. An audit of eleven representative systems shows that entities are largely solved but attributes, relationships, and closure remain missing. The result would align model outputs more closely with the content of actual radiologist reports.

Core claim

Medical imaging should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. A good parse satisfies three properties, in order: decision (names the right things in the current image), reconstruction (content is rich enough to regenerate that image), and prediction (content is rich enough to forecast patient evolution). Quantitative measurements are derived from this content; they are not predicted alongside it.

What carries the argument

Medical image parsing, a structured representation of entities (named structures and findings), attributes (descriptions such as margin regularity or severity), and relationships (connections such as relative position or temporal change), required to satisfy decision, reconstruction, and prediction in that order.

If this is right

Quantitative measurements would be computed from the parse content rather than generated as separate predictions.
Training objectives would shift to reward mutual consistency among entities, attributes, and relationships.
Clinical reports could be produced directly from the parse instead of from masks.
Model outputs would explain image content rather than only measure it.
Existing mask-based clinical infrastructure would need replacement by parse-based systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such parses could allow direct generation of natural-language reports that match radiologist style more closely.
New datasets would be required that explicitly label attributes and relationships in addition to entities.
Longitudinal tracking of disease could improve because relationships explicitly encode change over time.
Integration with electronic health records might become simpler if the output already contains the structured elements used in reports.

Load-bearing premise

That a structured parse meeting the three properties can be produced at scale and will prove clinically better than masks without creating new failure modes.

What would settle it

A controlled test in which a system emitting such a parse fails to produce more accurate clinical reports, image reconstructions, or patient predictions than current mask-based systems.

Figures

Figures reproduced from arXiv: 2605.11438 by Alan L. Yuille, Siddharth Gupta, Zongwei Zhou.

**Figure 2.** Figure 2: Example false-positive organ mask. In public CT case s1223, the reference left-kidney [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example medical image parse. Two entities are present ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a clear case that masks fall short of clinical reports and pushes for structured parsing instead, but the reconstruction property risks reintroducing the complexity it aims to avoid.

read the letter

The main thing here is that current medical imaging models output masks that capture size and location but miss the structured description a radiologist actually writes. The authors want to replace that with a parse containing entities, attributes, and relationships that also meets three properties: correct decisions on what's present, enough content to reconstruct the image, and enough content to predict how things will evolve. They back this up with an audit of eleven systems showing entities are mostly handled while attributes, relationships, and closure are not. That framing is useful and the gap they point out is real. The paper does well at spelling out why masks alone are not enough for clinical alignment and at giving a simple checklist for what a better output should satisfy. The audit, even at a high level, gives some grounding to the claim that the field has not moved past masks. The soft spot is the reconstruction property. If the parse must hold enough detail to regenerate the original image, it could end up carrying information density close to a mask or an implicit model, which weakens the argument that parsing will be simpler or more explainable. The paper offers no example of a parse that hits all three properties together or any training signal that would enforce them without new failure modes like inconsistent relationships. The audit would also be easier to evaluate with more on selection and scoring criteria. This is for researchers in medical vision who are already questioning whether segmentation is the right target. It deserves peer review because the mismatch with clinical reports is worth discussing even if the proposed fix needs more concrete development.

Referee Report

2 major / 2 minor

Summary. The paper claims that medical imaging research should shift its central output from per-voxel segmentation masks to structured medical image parses consisting of entities, attributes, relationships, and closure. These parses must satisfy three properties in order: decision (naming the right things), reconstruction (rich enough to regenerate the image), and prediction (rich enough to forecast evolution). An audit of eleven representative systems is presented, concluding that entities are largely solved but attributes, relationships, and closure remain near-empty. The path forward is a commitment to richer outputs and appropriate training signals.

Significance. If the proposed framework holds and the audit is substantiated, this work could redirect the field toward more clinically relevant, explainable outputs that align better with radiologist reports. It highlights a conceptual gap between current mask-based methods and the structured information needed for decision-making and prognosis. However, without concrete examples of such parses or methods to enforce the three properties, the significance depends on future validation that the approach avoids new failure modes and is superior to masks.

major comments (2)

[Audit of Current Systems] The audit of eleven systems is central to the claim that the field is far from producing well-formed parses. However, the selection criteria for the systems, the detailed evaluation protocol for assessing decision, reconstruction, prediction, and closure, and the scoring details are not provided. This makes it impossible to verify the conclusion that 'None emits a well-formed parse' and that attributes, relationships, and closure are 'near-empty'. Provide these details in the audit section to support the evidence.
[Definition of Parsing Properties] The reconstruction property requires the parse to be rich enough to regenerate the original image. This could make the parse informationally equivalent to a dense mask or require an accompanying generative model, eroding the claimed advantage over masks. No derivation or example is given showing how the three properties can be jointly enforced via training signals without introducing issues like annotation explosion or inconsistent closure. This is load-bearing for the central argument that parsing is both achievable and superior.

minor comments (2)

[Abstract] The phrase 'the report a radiologist writes contains almost nothing a mask can express' is a strong claim; consider adding a brief example contrasting a typical mask output with what a parse would provide.
[Introduction] The term 'closure' is used in the abstract and early text but its precise meaning (e.g., consistency across the parse) is clarified only later; define it upon first use for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify how the manuscript can better support its central claims. We address each major comment below.

read point-by-point responses

Referee: [Audit of Current Systems] The audit of eleven systems is central to the claim that the field is far from producing well-formed parses. However, the selection criteria for the systems, the detailed evaluation protocol for assessing decision, reconstruction, prediction, and closure, and the scoring details are not provided. This makes it impossible to verify the conclusion that 'None emits a well-formed parse' and that attributes, relationships, and closure are 'near-empty'. Provide these details in the audit section to support the evidence.

Authors: We agree that the audit requires greater transparency to allow independent verification. In the revised manuscript we will expand the audit section with the explicit selection criteria used to identify the eleven representative systems, the step-by-step evaluation protocol applied to each parsing property (decision, reconstruction, prediction, and closure), and the scoring rubric employed. These additions will directly support the reported conclusion that none of the systems produces a well-formed parse and that attributes, relationships, and closure remain near-empty. revision: yes
Referee: [Definition of Parsing Properties] The reconstruction property requires the parse to be rich enough to regenerate the original image. This could make the parse informationally equivalent to a dense mask or require an accompanying generative model, eroding the claimed advantage over masks. No derivation or example is given showing how the three properties can be jointly enforced via training signals without introducing issues like annotation explosion or inconsistent closure. This is load-bearing for the central argument that parsing is both achievable and superior.

Authors: The reconstruction property is defined at the semantic level: the parse must contain sufficient named entities, attributes, and relationships to permit regeneration of the image's clinically relevant structures, not pixel-wise equivalence to a dense mask. This preserves the interpretability and compactness advantages over masks. We acknowledge that the original manuscript provides neither a formal derivation nor concrete examples of joint enforcement. In the revision we will add an illustrative example of a parse satisfying the three properties in order and a brief discussion of possible training signals, while clarifying that full resolution of annotation and consistency challenges is left to future empirical work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; conceptual argument is self-contained

full rationale

The paper advances a position that medical imaging should shift from per-voxel masks to structured parsing satisfying decision, reconstruction, and prediction properties, supported by an audit of eleven external systems. No equations, fitted parameters, or derivations appear. The central claim does not reduce to self-definition, renamed empirical patterns, or load-bearing self-citations; the audit and three-property framework are presented as independent motivation rather than tautological restatement of inputs. This is the expected outcome for a non-mathematical advocacy paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the paper introduces the three parsing properties and the audit conclusion but provides no formal axioms, free parameters, or invented entities with independent evidence.

pith-pipeline@v0.9.0 · 5581 in / 1051 out tokens · 38046 ms · 2026-05-13T01:37:19.340211+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve).
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt this framework for medical imaging and extend it forward in time. A parser maps a medical observation to a structured patient state.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Anja Thieme, Sam Bond- Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, and Step...

work page arXiv 2024
[2]

Pedro RAS Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Youse...

work page 2024
[3]

Label critic: Design data before models

Pedro RAS Bassi, Qilong Wu, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Alan Yuille, and Zongwei Zhou. Label critic: Design data before models. InIEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025. URLhttps://github.com/PedroRASB/ LabelCritic

work page 2025
[4]

Radgpt: Constructing 3d image-text tumor datasets

Pedro RAS Bassi, Mehmet Can Yavuz, Ibrahim Ethem Hamamci, Sezgin Er, Xiaoxi Chen, Wenxuan Li, Bjoern Menze, Sergio Decherchi, Andrea Cavalli, Kang Wang, Yang Yang, Alan Yuille, and Zongwei Zhou. Radgpt: Constructing 3d image-text tumor datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23720–23730, 2025. URLhttps://...

work page 2025
[5]

Scaling artificial intelligence for multi-tumor early detection with more reports, fewer masks.arXiv preprint arXiv:2510.14803,

Pedro RAS Bassi, Xinze Zhou, Wenxuan Li, Szymon Płotka, Jieneng Chen, Qi Chen, Zheren Zhu, Jakub Prz ˛ ado, Ibrahim E Hamamci, Sezgin Er, Xiaoxi Chen, Mehmet Can Yavuz, Yu- Cheng Chou, Tianyu Lin, Kang Wang, Yucheng Tang, Jaroslaw B Cwikla, Sergio Decherchi, Andrea Cavalli, Yang Yang, Alan L Yuille, and Zongwei Zhou. Scaling artificial intelligence for mu...

work page arXiv
[6]

URLhttps://github.com/MrGiovanni/R-Super

work page
[7]

PadChest-GR: A bilingual chest x-ray dataset for grounded radiology report generation.NEJM AI, 2(7), 2025

Daniel C Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes 10 Pérez-Rodríguez, Kenji Takeda, José María Salinas, Javier Alvarez-Valle, Joaquín Galant Her- rero, and Antonio Pertusa. PadChest-GR: A bilingual chest x-ray dataset for grounded radiolog...

work page doi:10.1056/aidbp2401120 2025
[8]

Towards generalizable tumor synthesis

Qi Chen, Xiaoxi Chen, Haorui Song, Zhiwei Xiong, Alan Yuille, Chen Wei, and Zongwei Zhou. Towards generalizable tumor synthesis. InIEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 11147–11158, 2024. URL https://github.com/ MrGiovanni/DiffTumor

work page 2024
[9]

Scaling tumor segmentation: Best lessons from real and synthetic data

Qi Chen, Xinze Zhou, Chen Liu, Hao Chen, Wenxuan Li, Zekun Jiang, Ziyan Huang, Yuxuan Zhao, Dexin Yu, Junjun He, Yefeng Zheng, Ling Shao, Alan Yuille, and Zongwei Zhou. Scaling tumor segmentation: Best lessons from real and synthetic data. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 24001–24013, 2025. URL https://gi...

work page 2025
[10]

Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025

Yixiong Chen, Wenjie Xiao, Pedro RAS Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, and Alan Yuille. Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025. URLhttps://github.com/Schuture/DeepTumorVQA

work page arXiv 2025
[11]

Liver imaging reporting and data system (LI-RADS) version 2018: Imaging of hepato- cellular carcinoma in at-risk patients.Radiology, 289(3):816–830, 2018

Victoria Chernyak, Kathryn J Fowler, Aya Kamaya, Ania Z Kielar, Khaled M Elsayes, Mustafa R Bashir, Yuko Kono, Richard K Do, Donald G Mitchell, Amit G Singal, An Tang, and Claude B Sirlin. Liver imaging reporting and data system (LI-RADS) version 2018: Imaging of hepato- cellular carcinoma in at-risk patients.Radiology, 289(3):816–830, 2018

work page 2018
[12]

New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1)

E A Eisenhauer, P Therasse, J Bogaerts, L H Schwartz, D Sargent, R Ford, J Dancey, S Arbuck, S Gwyther, M Mooney, L Rubinstein, L Shankar, L Dodd, R Kaplan, D Lacombe, and J Verweij. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). European Journal of Cancer, 45(2):228–247, 2009

work page 2009
[13]

An artificial agent for anatomical landmark detection in medical images

Florin C Ghesu, Bogdan Georgescu, Tommaso Mansi, Dominik Neumann, Joachim Hornegger, and Dorin Comaniciu. An artificial agent for anatomical landmark detection in medical images. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 229–237. Springer, 2016

work page 2016
[14]

Representing part-whole hierarchies in foundation models by learning localizability, composability, and decomposability from anatomy via self-supervision

Mohammad Reza Hosseinzadeh Taher, Michael B Gotway, and Jianming Liang. Representing part-whole hierarchies in foundation models by learning localizability, composability, and decomposability from anatomy via self-supervision. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11269–11281, 2024

work page 2024
[15]

Label-free liver tumor segmentation

Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan L Yuille, and Zongwei Zhou. Label-free liver tumor segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7422–7432, 2023. URL https://github.com/ MrGiovanni/SyntheticTumors

work page 2023
[16]

Localized gaussian splatting editing with contextual awareness

Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, and Jianming Liang. Foundation X: Integrating classification, localization, and segmentation through lock-release pretraining strategy for chest x-ray analysis. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3647–3656, 2025. doi: 10.1109/W AC...

work page doi:10.1109/w 2025
[17]

RadGraph: Extracting clinical entities and relations from radiology reports

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis P Langlotz, and Pranav Rajpurkar. RadGraph: Extracting clinical entities and relations from radiology reports. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[18]

CheXRelNet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays

Gaurang Karwande, Amarachi B Mbakwe, Joy T Wu, Leo Anthony Celi, Mehdi Moradi, and Ismini Lourentzou. CheXRelNet: An anatomy-aware model for tracking longitudinal relationships between chest x-rays. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), pages 581–591. Springer, 2022. 11

work page 2022
[19]

From pixel to cancer: Cellular automata in computed tomography

Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 36–46. Springer, 2024. URLhttps://github.com/MrGiovanni/Pixel2Cancer

work page 2024
[20]

Abdomenatlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking.Medical Image Analysis, page 103285, 2024

Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, Yutong Tang, Yining Cao, Haoqi Han, Zheyuan Zhang, Jiawei Liu, Tiezheng Zhang, Yujiu Ma, Jincheng Wang, Guang Zhang, Alan Yuille, and Zongwei Zhou. Abdomenatlas: A large-scale, detailed-annotated, & multi-center dataset for efficie...

work page 2024
[21]

Scalemai: Accelerating the development of trusted datasets and ai models.arXiv preprint arXiv:2501.03410, 2025

Wenxuan Li, Pedro RAS Bassi, Tianyu Lin, Yu-Cheng Chou, Xinze Zhou, Yucheng Tang, Fabian Isensee, Kang Wang, Qi Chen, Xiaowei Xu, Jin Ye, Zheren Zhu, Sergio Decherchi, Andrea Cavalli, Alan L Yuille, and Zongwei Zhou. Scalemai: Accelerating the development of trusted datasets and ai models.arXiv preprint arXiv:2501.03410, 2025. URL https: //github.com/MrGi...

work page arXiv 2025
[22]

Pants: The pancreatic tumor segmentation dataset

Wenxuan Li, Xinze Zhou, Qi Chen, Tianyu Lin, Pedro RAS Bassi, Xiaoxi Chen, Chen Ye, Zheren Zhu, Kai Ding, Heng Li, Kang Wang, Yang Yang, Yucheng Tang, Daguang Xu, Alan L Yuille, and Zongwei Zhou. Pants: The pancreatic tumor segmentation dataset. InConference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps:/...

work page 2025
[23]

Text-driven tumor synthesis.IEEE Transactions on Medical Imaging, 2026

Xinran Li, Yi Shuai, Chen Liu, Qi Chen, Tianyu Lin, Pengfei Guo, Dong Yang, Can Zhao, Pedro RAS Bassi, Daguang Xu, Kang Wang, Yang Yang, Alan L Yuille, and Zongwei Zhou. Text-driven tumor synthesis.IEEE Transactions on Medical Imaging, 2026. URL https: //github.com/MrGiovanni/TextoMorph

work page 2026
[24]

Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography.Medical Image Analysis, page 103226, 2024

Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography.Medical Image Analysis, page 103226, 2024. URL https://github.com/ ljwztc/CLIP-Drive...

work page 2024
[25]

Auditing significance, metric choice, and demographic fairness in med- ical ai challenges.arXiv preprint arXiv:2512.19091, 2025

Ariel Lubonja, Pedro RAS Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L Yuille, and Zongwei Zhou. Auditing significance, metric choice, and demographic fairness in med- ical ai challenges.arXiv preprint arXiv:2512.19091, 2025. URL https://github.com/ ariellubonja/RankInsight

work page arXiv 2025
[26]

Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischner Society 2017.Radiology, 284(1):228–243, 2017

Heber MacMahon, David P Naidich, Jin Mo Goo, Kyung Soo Lee, Ann N C Leung, John R Mayo, Atul C Mehta, Yoshiharu Ohno, Charles A Powell, Mathias Prokop, Geoffrey D Rubin, Cornelia M Schaefer-Prokop, William D Travis, Paul E Van Schil, and Alexander A Bankier. Guidelines for management of incidental pulmonary nodules detected on CT images: from the Fleischn...

work page 2017
[27]

Medsegfactory: Text-guided generation of medical image-mask pairs

Jiawei Mao, Yuhan Wang, Yucheng Tang, Daguang Xu, Kang Wang, Yang Yang, Zongwei Zhou, and Yuyin Zhou. Medsegfactory: Text-guided generation of medical image-mask pairs. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 21525–21535, 2025. URLhttps://github.com/jwmao1/MedSegFactory

work page 2025
[28]

ME-VLIP: A modular and efficient vision-language framework for generalizable medical image parsing

Mai A Shaaban, Amal Saqib, Shahad Emad Hardan, Darya Taratynova, Tausifa Jan Saleem, and Mohammad Yaqub. ME-VLIP: A modular and efficient vision-language framework for generalizable medical image parsing. InMICCAI 2025 Challenge FLARE, 2025. https: //openreview.net/forum?id=0LYAs4b8T6

work page 2025
[29]

Nomenclature of hepatic anatomy and resections: a review of the Brisbane 2000 system.Journal of Hepato-Biliary-Pancreatic Surgery, 12(5):351–355, 2005

Steven M Strasberg. Nomenclature of hepatic anatomy and resections: a review of the Brisbane 2000 system.Journal of Hepato-Biliary-Pancreatic Surgery, 12(5):351–355, 2005

work page 2000
[30]

Interactive and explainable region-guided radiology report generation

Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7433–7442, 2023. 12

work page 2023
[31]

Yuille, and Song-Chun Zhu

Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition.International Journal of Computer Vision, 63(2): 113–140, 2005

work page 2005
[32]

Chest ImaGenome dataset for clinical reasoning

Joy T Wu, Nkechinyere N Agu, Ismini Lourentzou, Arjun Sharma, Joseph A Paguio, Jasper S Yao, Edward C Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo A Celi, and Mehdi Moradi. Chest ImaGenome dataset for clinical reasoning. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[33]

Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, and Alan L. Yuille. Synthesize then compare: Detecting failures and anomalies for semantic segmentation.European Conference on Computer Vision (ECCV), pages 145–161, 2020

work page 2020
[34]

Exploiting structural consistency of chest anatomy for unsupervised anomaly detection in radiography images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Tiange Xiang, Yixiao Zhang, Yongyi Lu, Alan Yuille, Chaoyi Zhang, Weidong Cai, and Zongwei Zhou. Exploiting structural consistency of chest anatomy for unsupervised anomaly detection in radiography images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. URLhttps://github.com/MrGiovanni/SimSID

work page 2024
[35]

Prior-RadGraphFormer: A prior-knowledge-enhanced transformer for generating radiology graphs from X-rays

Yiheng Xiong, Jingsong Liu, Kamilia Zaripova, Sahand Sharifzadeh, Matthias Keicher, and Nassir Navab. Prior-RadGraphFormer: A prior-knowledge-enhanced transformer for generating radiology graphs from X-rays. InGraphs in Biomedical Image Analysis (GRAIL) Workshop at MICCAI, volume 14373 ofLecture Notes in Computer Science, pages 54–63. Springer, 2023

work page 2023
[36]

Medical world model: Generative simulation of tumor evolution for treatment planning.arXiv preprint arXiv:2506.02327, 2025

Yijun Yang, Zhao-Yang Wang, Qiuping Liu, Shuwen Sun, Kang Wang, Rama Chellappa, Zongwei Zhou, Alan Yuille, Lei Zhu, Yu-Dong Zhang, and Jieneng Chen. Medical world model: Generative simulation of tumor evolution for treatment planning.arXiv preprint arXiv:2506.02327, 2025. URLhttps://github.com/scott-yjyang/MeWM

work page arXiv 2025
[37]

Automated chest x-ray report generation remains unsolved

Xiaoman Zhang, Julian Nicolas Acosta, Xiaoli Yang, Subathra Adithan, Luyang Luo, Hong-Yu Zhou, Joshua Miller, Ouwen Huang, Zongwei Zhou, Ibrahim Ethem Hamamci, Shruthi Bannur, Kenza Bouzid, Xi Zhang, Zaiqiao Meng, Aaron Nicolson, Bevan Koopman, Inhyeok Baek, Hanbin Ko, Mercy Prasanna Ranjit, Shaury Srivastav, Sriram Gnana Sambanthan, and Pranav Rajpurkar....

work page 2026
[38]

A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities.Nature Methods, 22(1):166–176, 2025

Theodore Zhao, Yu Gu, Jianwei Yang, Naoto Usuyama, Ho Hin Lee, Sid Kiblawi, Tristan Naumann, Jianfeng Gao, Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon, and Sheng Wang. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities.Nature Methods, 22...

work page doi:10.1038/s41592-024-02499-w 2025
[39]

Large-vocabulary segmentation for medical images with text prompts.npj Digital Medicine, 8:566, 2025

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.npj Digital Medicine, 8:566, 2025. doi: 10.1038/s41746-025-01964-w. 13

work page doi:10.1038/s41746-025-01964-w 2025