pith. machine review for the scientific record. sign in

arxiv: 2512.17326 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords pathologyvision-language modelswhole-slide imagessynthetic data generationvisual question answeringopen datasetmedical AI
0
0 comments X

The pith

A public pipeline and dataset for whole-slide pathology images trains a vision-language model that outperforms MedGemma on visual question answering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop Polysome, an open tool that generates synthetic question-answer pairs from whole-slide images using existing public datasets. They apply it to the HISTAI collection to produce HISTAI-Instruct, a resource with 24,259 slides and more than 1.1 million pairs. Training ANTONI-α on this data yields a model that performs better than MedGemma at identifying tissues, detecting neoplasms, and making differential diagnoses from entire slides. Results improve when using larger portions of the synthetic data. Releasing all code, data, and models publicly aims to lower barriers for building reliable pathology AI assistants.

Core claim

The central discovery is that synthetic data generated by Polysome from the HISTAI dataset can be used to train ANTONI-α, a whole-slide vision-language model that outperforms the MedGemma baseline on visual question answering tasks involving tissue identification, neoplasm detection, and differential diagnosis.

What carries the argument

Polysome, the standardised tool for creating synthetic instruction-response pairs from whole-slide images and clinical metadata.

Load-bearing premise

Synthetic instruction-response pairs from Polysome are of high enough quality and clinical relevance for effective VLM training.

What would settle it

Running ANTONI-α on a new set of real clinical VQA examples from pathologists and finding no performance advantage over MedGemma would falsify the claim of effective generalization from synthetic data.

Figures

Figures reproduced from arXiv: 2512.17326 by Carlijn Lems, Francesco Ciompi, Fr\'ed\'erique Meeuwsen, Geert Litjens, Jeroen van der Laak, Sander Moonemans, Sebastiaan Ram.

Figure 1
Figure 1. Figure 1: HISTAI data preprocessing pipeline. Blue: retained cases; orange: dis￾carded cases. 2. Methods This section describes a) the curation of the source dataset and WSI preprocessing pipeline used to extract visual features, b) the use of our synthetic generation pipeline Polysome to generate > 1.1M conversational instruction pairs, released as HISTAI-Instruct, and c) the architecture and training protocol of A… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of ANTONI-α. Image processing modules (blue) extract fea￾tures via VIRCHOW and PRISM. These features are aligned with conversational data (green) via a Vision Projector. The MedGemma LLM (pink) generates responses using inputs from both modalities. Snowflake and flame icons denote frozen and trainable parameters, respectively, during the instruction-tuning stage (in contrast to pretraining, wh… view at source ↗
Figure 3
Figure 3. Figure 3: Validation pipeline for comparing ANTONI-α and MedGemma. Both models process the same WSI. For MedGemma, the WSI is first downscaled and packed. The evaluation consists of three questions: Q1 targets organ or tissue identifi￾cation, Q2 detects the presence of a neoplasm, and Q3 requires the most likely diagnosis selected from three possible candidate differentials. 7 [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of ANTONI-α and MedGemma on a dermatology case (dermatofibroma). ANTONI-α (left) processes the full-resolution WSI and syn￾thesizes its findings into the correct diagnosis of dermatofibroma. In contrast, MedGemma (right) relies on a lower resolution thumbnail. It is unable to assess margins or describe cell details, leading to an incorrect diagnosis. by condensing high-level slide in… view at source ↗
read the original abstract

Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Polysome, a standardized tool for generating synthetic instruction-response pairs from whole-slide images (WSIs). It applies Polysome to the public HISTAI dataset to produce HISTAI-Instruct, a dataset of 24,259 slides and over 1.1 million pairs. This is used to train ANTONI-α, a vision-language model for WSI-level visual question answering (VQA). The central claim is that ANTONI-α outperforms MedGemma on tasks of tissue identification, neoplasm detection, and differential diagnosis, with additional comparisons of model variants trained on varying data volumes. All methods, data, and code are released publicly.

Significance. If the performance claims are substantiated with full evaluation details, the work is significant for computational pathology. It directly addresses the scarcity of publicly available WSI-report paired data by releasing an open pipeline (Polysome), a large-scale instruction-tuning dataset (HISTAI-Instruct), and trained models (ANTONI-α variants). This promotes reproducibility and lowers barriers to developing VLMs as pathologist co-pilots, with the public release of code and data representing a clear strength for the field.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.
  2. [Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.
minor comments (1)
  1. [Abstract] Abstract: the mention of 'multiple incarnations of ANTONI-α trained with different amounts of data' would benefit from a brief reference to the specific data volumes used and a pointer to the corresponding performance table or figure for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical validation and methodological transparency of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.

    Authors: We agree that the original submission omitted explicit quantitative metrics, statistical tests, and bias analysis, which are necessary to substantiate the central claims. In the revised manuscript, we have added a dedicated Results subsection with a table reporting accuracy, precision, recall, and F1 scores for ANTONI-α versus MedGemma on tissue identification, neoplasm detection, and differential diagnosis. We include the evaluation data splits (80/10/10 on HISTAI-Instruct), McNemar's tests for statistical significance, and a bias analysis comparing performance on synthetic pairs versus a small held-out set of real clinical queries. These additions directly address the concern and allow readers to evaluate whether the gains are meaningful. revision: yes

  2. Referee: [Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.

    Authors: We acknowledge the validity of this concern regarding clinical fidelity and potential hallucinations in the synthetic data. Polysome relies on slide-level metadata and annotations from the public HISTAI dataset with structured prompting for grounding, but we agree this falls short of using real report text or expert review. In the revision, we have expanded the Methods section with explicit prompt templates, examples of generated pairs, and a new quantitative analysis of question-type diversity and potential biases. We have also added a limitations paragraph discussing risks of reduced generalization to authentic clinical VQA and outline plans for future expert validation. This provides greater transparency without overclaiming fidelity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical external comparison

full rationale

The paper introduces Polysome as a tool to generate synthetic instruction-response pairs from the public HISTAI dataset, creates HISTAI-Instruct, trains ANTONI-α on it, and reports empirical outperformance versus the external baseline MedGemma on WSI-level VQA tasks. No derivations, equations, or fitted parameters are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central performance claims are falsifiable via held-out evaluation against an independent model and do not rely on definitional equivalence or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The work relies on standard machine learning assumptions for training and evaluation. No free parameters are explicitly fitted in the abstract beyond typical training hyperparameters. New entities are the introduced tool and dataset.

axioms (1)
  • standard math Standard i.i.d. assumptions for training/validation/test splits and generalization in supervised learning hold for the generated instruction data.
    Implicit in any empirical ML training and evaluation setup described.
invented entities (3)
  • Polysome no independent evidence
    purpose: Standardized tool for synthetic instruction generation from WSIs
    New tool introduced to address data scarcity for VLM training.
  • HISTAI-Instruct no independent evidence
    purpose: Large-scale whole-slide instruction tuning dataset
    Generated dataset spanning 24,259 slides and 1.1M pairs.
  • ANTONI-α no independent evidence
    purpose: Vision-language model for WSI VQA
    Trained model using the new dataset.

pith-pipeline@v0.9.0 · 5554 in / 1339 out tokens · 29133 ms · 2026-05-16T21:00:49.313541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    Weishaupt, Drew F

    Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, and Faisal Mahmood. Evidence-based diagnostic reasoning with multi-agent copilot for human pathology, 2025. URL https://arxiv.org/abs/2506.20964

  2. [2]

    Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images, 2024 a

    Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images, 2024 a . URL https://arxiv.org/abs/2311.16480

  3. [3]

    Slidechat: A large vision-language assistant for whole-slide pathology image understanding

    Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, , Hu Ming, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761, 2024 b

  4. [4]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URL https://arxiv.org/abs/2305.14314

  5. [5]

    Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction, 2024

    Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction, 2024. URL https://arxiv.org/abs/2403.05396

  6. [6]

    Song, Ming Y

    Guillaume Jaume, Paul Doucet, Andrew H. Song, Ming Y. Lu, Cristina Almagro-Perez, Sophia J. Wagner, Anurag J. Vaidya, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, and Faisal Mahmood. Hest-1k: A dataset for spatial transcriptomics and histology image analysis. In Advances in Neural Information Processing Systems, December 2024

  7. [7]

    1399 H & E -stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset

    Geert Litjens, Peter Bandi, Babak Ehteshami Bejnordi, Oscar Geessink, Maschenka Balkenhol, Peter Bult, Altuna Halilovic, Meyke Hermsen, Rob van de Loo, Rob Vogels, Quirine F Manson, Nikolas Stathonikos, Alexi Baidoshvili, Paul van Diest, Carla Wauters, Marcory van Dijk, and Jeroen van der Laak. 1399 H & E -stained sentinel lymph node sections of breast ca...

  8. [8]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URL https://arxiv.org/abs/2304.08485

  9. [9]

    A multimodal generative ai copilot for human pathology

    Ming Lu, Bowen Chen, Drew Williamson, Richard Chen, Melissa Zhao, Aaron Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, Amr Soliman, Chengkuan Chen, Tong Ding, Judy Wang, Georg Gerber, Ivy Liang, Long Le, Anil Parwani, Luca Weishaupt, and Faisal Mahmood. A multimodal generative ai copilot for human pathology. Nature, 634: 0 466--473, 06 2024...

  10. [10]

    Lucassen, Tijn van de Luijtgaarden, Sander P

    Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. On the importance of text preprocessing for multimodal representation learning and pathology report generation, 2025. URL https://arxiv.org/abs/2502.19285

  11. [11]

    Lucassen, Sander P

    Ruben T. Lucassen, Sander P. J. Moonemans, Tijn van de Luijtgaarden, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. Pathology report generation and multimodal representation learning for cutaneous melanocytic lesions. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, ...

  12. [12]

    Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025

    Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025. URL https://arxiv.org/abs/2505.12120

  13. [13]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

  14. [14]

    Prism: A multi-modal generative foundation model for slide-level histopathology

    George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254, 2024

  15. [15]

    Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H

    Manuel Tran, Paul Schmidle, Ruifeng Ray Guo, Sophia J. Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H. Murphree, Heather D. Hardway, Marina D'Amato, Judith Lefkes, Daan J. Geijs, Annette Feuchtinger, Alexander Böhner, Robert Kaczmarczyk, Tilo Biedermann, Avital L. Amir, Antien L. Mooyaart, Francesco Ciompi, Geert Litjens, Chen Wang, Nn...

  16. [16]

    Mart van Rijthoven, Witali Aswolinskiy, Leslie Tessier, Maschenka Balkenhol, Joep M. A. Bogaerts, Damien Drubay, Laura Comerma Blesa, Dieter Peeters, Elisabeth Specht Stovgaard, Anne-Vibeke L nkholm, Harry Haynes, Ligia Craciun, Denis Larsimont, Mohamed T. Amgad, Lee AD Cooper, Cyril de Kock, Valerie Dechering, Johannes Lotz, Nick Weiss, Mieke van Bocksta...

  17. [17]

    Kunz, Matthew C

    Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Ellen Yang, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan H. Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Mill...

  18. [18]

    Bernhard, Ran A

    Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Jinru Shia, Mithat Gonen, Martin R. Weiser, David S. Klimstra, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, and Siqi Liu. Prism2: Unlocking multi-modal general pathology ai with ...

  19. [19]

    The Cancer Genome Atlas Pan - Cancer analysis project

    John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, and Cancer Genome Atlas Research Network . The Cancer Genome Atlas Pan - Cancer analysis project. Nature Genetics, 45 0 (10): 0 1113--1120, oct 2013. doi:10.1038/ng.2764

  20. [20]

    A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025

    Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, and Hao Chen. A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025. URL https://arxiv.org/abs/2507.17303

  21. [21]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022. URL https://arxiv.org/abs/2205.01917

  22. [22]

    Accelerating data processing and benchmarking of ai models for pathology

    Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750, 2025