Recognition: 2 theorem links
· Lean TheoremDemocratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
Pith reviewed 2026-05-16 21:00 UTC · model grok-4.3
The pith
A public pipeline and dataset for whole-slide pathology images trains a vision-language model that outperforms MedGemma on visual question answering tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that synthetic data generated by Polysome from the HISTAI dataset can be used to train ANTONI-α, a whole-slide vision-language model that outperforms the MedGemma baseline on visual question answering tasks involving tissue identification, neoplasm detection, and differential diagnosis.
What carries the argument
Polysome, the standardised tool for creating synthetic instruction-response pairs from whole-slide images and clinical metadata.
Load-bearing premise
Synthetic instruction-response pairs from Polysome are of high enough quality and clinical relevance for effective VLM training.
What would settle it
Running ANTONI-α on a new set of real clinical VQA examples from pathologists and finding no performance advantage over MedGemma would falsify the claim of effective generalization from synthetic data.
Figures
read the original abstract
Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Polysome, a standardized tool for generating synthetic instruction-response pairs from whole-slide images (WSIs). It applies Polysome to the public HISTAI dataset to produce HISTAI-Instruct, a dataset of 24,259 slides and over 1.1 million pairs. This is used to train ANTONI-α, a vision-language model for WSI-level visual question answering (VQA). The central claim is that ANTONI-α outperforms MedGemma on tasks of tissue identification, neoplasm detection, and differential diagnosis, with additional comparisons of model variants trained on varying data volumes. All methods, data, and code are released publicly.
Significance. If the performance claims are substantiated with full evaluation details, the work is significant for computational pathology. It directly addresses the scarcity of publicly available WSI-report paired data by releasing an open pipeline (Polysome), a large-scale instruction-tuning dataset (HISTAI-Instruct), and trained models (ANTONI-α variants). This promotes reproducibility and lowers barriers to developing VLMs as pathologist co-pilots, with the public release of code and data representing a clear strength for the field.
major comments (2)
- [Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.
- [Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.
minor comments (1)
- [Abstract] Abstract: the mention of 'multiple incarnations of ANTONI-α trained with different amounts of data' would benefit from a brief reference to the specific data volumes used and a pointer to the corresponding performance table or figure for improved readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical validation and methodological transparency of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.
Authors: We agree that the original submission omitted explicit quantitative metrics, statistical tests, and bias analysis, which are necessary to substantiate the central claims. In the revised manuscript, we have added a dedicated Results subsection with a table reporting accuracy, precision, recall, and F1 scores for ANTONI-α versus MedGemma on tissue identification, neoplasm detection, and differential diagnosis. We include the evaluation data splits (80/10/10 on HISTAI-Instruct), McNemar's tests for statistical significance, and a bias analysis comparing performance on synthetic pairs versus a small held-out set of real clinical queries. These additions directly address the concern and allow readers to evaluate whether the gains are meaningful. revision: yes
-
Referee: [Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.
Authors: We acknowledge the validity of this concern regarding clinical fidelity and potential hallucinations in the synthetic data. Polysome relies on slide-level metadata and annotations from the public HISTAI dataset with structured prompting for grounding, but we agree this falls short of using real report text or expert review. In the revision, we have expanded the Methods section with explicit prompt templates, examples of generated pairs, and a new quantitative analysis of question-type diversity and potential biases. We have also added a limitations paragraph discussing risks of reduced generalization to authentic clinical VQA and outline plans for future expert validation. This provides greater transparency without overclaiming fidelity. revision: partial
Circularity Check
No significant circularity; claims rest on empirical external comparison
full rationale
The paper introduces Polysome as a tool to generate synthetic instruction-response pairs from the public HISTAI dataset, creates HISTAI-Instruct, trains ANTONI-α on it, and reports empirical outperformance versus the external baseline MedGemma on WSI-level VQA tasks. No derivations, equations, or fitted parameters are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central performance claims are falsifiable via held-out evaluation against an independent model and do not rely on definitional equivalence or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard i.i.d. assumptions for training/validation/test splits and generalization in supervised learning hold for the generated instruction data.
invented entities (3)
-
Polysome
no independent evidence
-
HISTAI-Instruct
no independent evidence
-
ANTONI-α
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Polysome, a standardised tool for synthetic instruction generation... generating HISTAI-Instruct... train ANTONI-α, a VLM capable of visual-question answering (VQA).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use HISTAI-Instruct to train ANTONI-α... outperforms MedGemma on WSI-level VQA tasks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, and Faisal Mahmood. Evidence-based diagnostic reasoning with multi-agent copilot for human pathology, 2025. URL https://arxiv.org/abs/2506.20964
-
[2]
Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images, 2024 a . URL https://arxiv.org/abs/2311.16480
-
[3]
Slidechat: A large vision-language assistant for whole-slide pathology image understanding
Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, , Hu Ming, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761, 2024 b
-
[4]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URL https://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction, 2024. URL https://arxiv.org/abs/2403.05396
-
[6]
Guillaume Jaume, Paul Doucet, Andrew H. Song, Ming Y. Lu, Cristina Almagro-Perez, Sophia J. Wagner, Anurag J. Vaidya, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, and Faisal Mahmood. Hest-1k: A dataset for spatial transcriptomics and histology image analysis. In Advances in Neural Information Processing Systems, December 2024
work page 2024
-
[7]
1399 H & E -stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset
Geert Litjens, Peter Bandi, Babak Ehteshami Bejnordi, Oscar Geessink, Maschenka Balkenhol, Peter Bult, Altuna Halilovic, Meyke Hermsen, Rob van de Loo, Rob Vogels, Quirine F Manson, Nikolas Stathonikos, Alexi Baidoshvili, Paul van Diest, Carla Wauters, Marcory van Dijk, and Jeroen van der Laak. 1399 H & E -stained sentinel lymph node sections of breast ca...
-
[8]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URL https://arxiv.org/abs/2304.08485
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
A multimodal generative ai copilot for human pathology
Ming Lu, Bowen Chen, Drew Williamson, Richard Chen, Melissa Zhao, Aaron Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, Amr Soliman, Chengkuan Chen, Tong Ding, Judy Wang, Georg Gerber, Ivy Liang, Long Le, Anil Parwani, Luca Weishaupt, and Faisal Mahmood. A multimodal generative ai copilot for human pathology. Nature, 634: 0 466--473, 06 2024...
-
[10]
Lucassen, Tijn van de Luijtgaarden, Sander P
Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. On the importance of text preprocessing for multimodal representation learning and pathology report generation, 2025. URL https://arxiv.org/abs/2502.19285
-
[11]
Ruben T. Lucassen, Sander P. J. Moonemans, Tijn van de Luijtgaarden, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. Pathology report generation and multimodal representation learning for cutaneous melanocytic lesions. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, ...
work page 2025
-
[12]
Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025
Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025. URL https://arxiv.org/abs/2505.12120
-
[13]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Prism: A multi-modal generative foundation model for slide-level histopathology
George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254, 2024
-
[15]
Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H
Manuel Tran, Paul Schmidle, Ruifeng Ray Guo, Sophia J. Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H. Murphree, Heather D. Hardway, Marina D'Amato, Judith Lefkes, Daan J. Geijs, Annette Feuchtinger, Alexander Böhner, Robert Kaczmarczyk, Tilo Biedermann, Avital L. Amir, Antien L. Mooyaart, Francesco Ciompi, Geert Litjens, Chen Wang, Nn...
-
[16]
Mart van Rijthoven, Witali Aswolinskiy, Leslie Tessier, Maschenka Balkenhol, Joep M. A. Bogaerts, Damien Drubay, Laura Comerma Blesa, Dieter Peeters, Elisabeth Specht Stovgaard, Anne-Vibeke L nkholm, Harry Haynes, Ligia Craciun, Denis Larsimont, Mohamed T. Amgad, Lee AD Cooper, Cyril de Kock, Valerie Dechering, Johannes Lotz, Nick Weiss, Mieke van Bocksta...
-
[17]
Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Ellen Yang, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan H. Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Mill...
work page 2024
-
[18]
Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Jinru Shia, Mithat Gonen, Martin R. Weiser, David S. Klimstra, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, and Siqi Liu. Prism2: Unlocking multi-modal general pathology ai with ...
-
[19]
The Cancer Genome Atlas Pan - Cancer analysis project
John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, and Cancer Genome Atlas Research Network . The Cancer Genome Atlas Pan - Cancer analysis project. Nature Genetics, 45 0 (10): 0 1113--1120, oct 2013. doi:10.1038/ng.2764
-
[20]
A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025
Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, and Hao Chen. A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025. URL https://arxiv.org/abs/2507.17303
-
[21]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022. URL https://arxiv.org/abs/2205.01917
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Accelerating data processing and benchmarking of ai models for pathology
Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.