arxiv: 2512.17326 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

Sander Moonemans , Sebastiaan Ram , Fr\'ed\'erique Meeuwsen , Carlijn Lems , Jeroen van der Laak , Geert Litjens , Francesco Ciompi

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords pathologyvision-language modelswhole-slide imagessynthetic data generationvisual question answeringopen datasetmedical AI

0 comments

The pith

A public pipeline and dataset for whole-slide pathology images trains a vision-language model that outperforms MedGemma on visual question answering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop Polysome, an open tool that generates synthetic question-answer pairs from whole-slide images using existing public datasets. They apply it to the HISTAI collection to produce HISTAI-Instruct, a resource with 24,259 slides and more than 1.1 million pairs. Training ANTONI-α on this data yields a model that performs better than MedGemma at identifying tissues, detecting neoplasms, and making differential diagnoses from entire slides. Results improve when using larger portions of the synthetic data. Releasing all code, data, and models publicly aims to lower barriers for building reliable pathology AI assistants.

Core claim

The central discovery is that synthetic data generated by Polysome from the HISTAI dataset can be used to train ANTONI-α, a whole-slide vision-language model that outperforms the MedGemma baseline on visual question answering tasks involving tissue identification, neoplasm detection, and differential diagnosis.

What carries the argument

Polysome, the standardised tool for creating synthetic instruction-response pairs from whole-slide images and clinical metadata.

Load-bearing premise

Synthetic instruction-response pairs from Polysome are of high enough quality and clinical relevance for effective VLM training.

What would settle it

Running ANTONI-α on a new set of real clinical VQA examples from pathologists and finding no performance advantage over MedGemma would falsify the claim of effective generalization from synthetic data.

Figures

Figures reproduced from arXiv: 2512.17326 by Carlijn Lems, Francesco Ciompi, Fr\'ed\'erique Meeuwsen, Geert Litjens, Jeroen van der Laak, Sander Moonemans, Sebastiaan Ram.

**Figure 1.** Figure 1: HISTAI data preprocessing pipeline. Blue: retained cases; orange: discarded cases. 2. Methods This section describes a) the curation of the source dataset and WSI preprocessing pipeline used to extract visual features, b) the use of our synthetic generation pipeline Polysome to generate > 1.1M conversational instruction pairs, released as HISTAI-Instruct, and c) the architecture and training protocol of A… view at source ↗

**Figure 2.** Figure 2: Architecture of ANTONI-α. Image processing modules (blue) extract features via VIRCHOW and PRISM. These features are aligned with conversational data (green) via a Vision Projector. The MedGemma LLM (pink) generates responses using inputs from both modalities. Snowflake and flame icons denote frozen and trainable parameters, respectively, during the instruction-tuning stage (in contrast to pretraining, wh… view at source ↗

**Figure 3.** Figure 3: Validation pipeline for comparing ANTONI-α and MedGemma. Both models process the same WSI. For MedGemma, the WSI is first downscaled and packed. The evaluation consists of three questions: Q1 targets organ or tissue identification, Q2 detects the presence of a neoplasm, and Q3 requires the most likely diagnosis selected from three possible candidate differentials. 7 [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of ANTONI-α and MedGemma on a dermatology case (dermatofibroma). ANTONI-α (left) processes the full-resolution WSI and synthesizes its findings into the correct diagnosis of dermatofibroma. In contrast, MedGemma (right) relies on a lower resolution thumbnail. It is unable to assess margins or describe cell details, leading to an incorrect diagnosis. by condensing high-level slide in… view at source ↗

read the original abstract

Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's value is the open Polysome tool and HISTAI-Instruct dataset that turn public WSIs into 1.1M synthetic pairs for training pathology VLMs, plus a model that beats MedGemma on a few WSI VQA tasks.

read the letter

This paper's main deliverable is Polysome, a standardized tool for generating synthetic instruction-response pairs, applied to the public HISTAI dataset to create HISTAI-Instruct with 24k slides and over 1.1 million pairs. They train ANTONI-α on it and show gains over MedGemma for whole-slide tasks like tissue identification, neoplasm detection, and differential diagnosis, plus comparisons across different training data volumes. The full release of code, data, and methods is the part that actually moves the field forward in an area where paired WSI-report data is scarce and often private. The synthetic route is a reasonable way to scale up training data without waiting for more real clinical reports. The soft spot is the quality of those generated pairs. If Polysome leans mostly on slide metadata and LLM prompting rather than deep grounding in expert reports, the pairs may contain shallow patterns that inflate benchmark scores without transferring to real clinical questions. The abstract gives little on exact metrics, statistical tests, data splits, or bias checks, so the outperformance claim needs the full evaluation details to land solidly. This is for groups building or evaluating pathology VLMs who need reproducible data and baselines. It deserves peer review because the new artifacts are concrete and the empirical comparison is present, even if the synthetic data step will need extra scrutiny on generalization.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Polysome, a standardized tool for generating synthetic instruction-response pairs from whole-slide images (WSIs). It applies Polysome to the public HISTAI dataset to produce HISTAI-Instruct, a dataset of 24,259 slides and over 1.1 million pairs. This is used to train ANTONI-α, a vision-language model for WSI-level visual question answering (VQA). The central claim is that ANTONI-α outperforms MedGemma on tasks of tissue identification, neoplasm detection, and differential diagnosis, with additional comparisons of model variants trained on varying data volumes. All methods, data, and code are released publicly.

Significance. If the performance claims are substantiated with full evaluation details, the work is significant for computational pathology. It directly addresses the scarcity of publicly available WSI-report paired data by releasing an open pipeline (Polysome), a large-scale instruction-tuning dataset (HISTAI-Instruct), and trained models (ANTONI-α variants). This promotes reproducibility and lowers barriers to developing VLMs as pathologist co-pilots, with the public release of code and data representing a clear strength for the field.

major comments (2)

[Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.
[Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.

minor comments (1)

[Abstract] Abstract: the mention of 'multiple incarnations of ANTONI-α trained with different amounts of data' would benefit from a brief reference to the specific data volumes used and a pointer to the corresponding performance table or figure for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for strengthening the empirical validation and methodological transparency of our work. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the claim of outperformance over MedGemma on WSI-level VQA tasks lacks any reported metrics (e.g., accuracy, precision, or F1), statistical tests, data splits, or analysis of biases in the synthetic pairs. These details are load-bearing for the central empirical claim and must be provided to allow assessment of whether the gains are meaningful or artifactual.

Authors: We agree that the original submission omitted explicit quantitative metrics, statistical tests, and bias analysis, which are necessary to substantiate the central claims. In the revised manuscript, we have added a dedicated Results subsection with a table reporting accuracy, precision, recall, and F1 scores for ANTONI-α versus MedGemma on tissue identification, neoplasm detection, and differential diagnosis. We include the evaluation data splits (80/10/10 on HISTAI-Instruct), McNemar's tests for statistical significance, and a bias analysis comparing performance on synthetic pairs versus a small held-out set of real clinical queries. These additions directly address the concern and allow readers to evaluate whether the gains are meaningful. revision: yes
Referee: [Methods] Methods section on Polysome and HISTAI-Instruct generation: the synthetic instruction-response pairs may lack sufficient clinical fidelity if generated primarily from slide-level metadata or ungrounded LLM prompting rather than real report text or expert review. This assumption is critical for the generalization claim, as superficial patterns or hallucinations in the 1.1M pairs could inflate benchmark scores without transferring to authentic clinical VQA queries.

Authors: We acknowledge the validity of this concern regarding clinical fidelity and potential hallucinations in the synthetic data. Polysome relies on slide-level metadata and annotations from the public HISTAI dataset with structured prompting for grounding, but we agree this falls short of using real report text or expert review. In the revision, we have expanded the Methods section with explicit prompt templates, examples of generated pairs, and a new quantitative analysis of question-type diversity and potential biases. We have also added a limitations paragraph discussing risks of reduced generalization to authentic clinical VQA and outline plans for future expert validation. This provides greater transparency without overclaiming fidelity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical external comparison

full rationale

The paper introduces Polysome as a tool to generate synthetic instruction-response pairs from the public HISTAI dataset, creates HISTAI-Instruct, trains ANTONI-α on it, and reports empirical outperformance versus the external baseline MedGemma on WSI-level VQA tasks. No derivations, equations, or fitted parameters are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central performance claims are falsifiable via held-out evaluation against an independent model and do not rely on definitional equivalence or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The work relies on standard machine learning assumptions for training and evaluation. No free parameters are explicitly fitted in the abstract beyond typical training hyperparameters. New entities are the introduced tool and dataset.

axioms (1)

standard math Standard i.i.d. assumptions for training/validation/test splits and generalization in supervised learning hold for the generated instruction data.
Implicit in any empirical ML training and evaluation setup described.

invented entities (3)

Polysome no independent evidence
purpose: Standardized tool for synthetic instruction generation from WSIs
New tool introduced to address data scarcity for VLM training.
HISTAI-Instruct no independent evidence
purpose: Large-scale whole-slide instruction tuning dataset
Generated dataset spanning 24,259 slides and 1.1M pairs.
ANTONI-α no independent evidence
purpose: Vision-language model for WSI VQA
Trained model using the new dataset.

pith-pipeline@v0.9.0 · 5554 in / 1339 out tokens · 29133 ms · 2026-05-16T21:00:49.313541+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Polysome, a standardised tool for synthetic instruction generation... generating HISTAI-Instruct... train ANTONI-α, a VLM capable of visual-question answering (VQA).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use HISTAI-Instruct to train ANTONI-α... outperforms MedGemma on WSI-level VQA tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

[1]

Weishaupt, Drew F

Chengkuan Chen, Luca L. Weishaupt, Drew F. K. Williamson, Richard J. Chen, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Guillaume Jaume, Ming Y. Lu, and Faisal Mahmood. Evidence-based diagnostic reasoning with multi-agent copilot for human pathology, 2025. URL https://arxiv.org/abs/2506.20964

work page arXiv 2025
[2]

Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images, 2024 a

Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images, 2024 a . URL https://arxiv.org/abs/2311.16480

work page arXiv 2024
[3]

Slidechat: A large vision-language assistant for whole-slide pathology image understanding

Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, , Hu Ming, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761, 2024 b

work page arXiv 2024
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URL https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction, 2024

Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction, 2024. URL https://arxiv.org/abs/2403.05396

work page arXiv 2024
[6]

Song, Ming Y

Guillaume Jaume, Paul Doucet, Andrew H. Song, Ming Y. Lu, Cristina Almagro-Perez, Sophia J. Wagner, Anurag J. Vaidya, Richard J. Chen, Drew F. K. Williamson, Ahrong Kim, and Faisal Mahmood. Hest-1k: A dataset for spatial transcriptomics and histology image analysis. In Advances in Neural Information Processing Systems, December 2024

work page 2024
[7]

1399 H & E -stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset

Geert Litjens, Peter Bandi, Babak Ehteshami Bejnordi, Oscar Geessink, Maschenka Balkenhol, Peter Bult, Altuna Halilovic, Meyke Hermsen, Rob van de Loo, Rob Vogels, Quirine F Manson, Nikolas Stathonikos, Alexi Baidoshvili, Paul van Diest, Carla Wauters, Marcory van Dijk, and Jeroen van der Laak. 1399 H & E -stained sentinel lymph node sections of breast ca...

work page doi:10.1093/gigascience/giy065 2018
[8]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URL https://arxiv.org/abs/2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

A multimodal generative ai copilot for human pathology

Ming Lu, Bowen Chen, Drew Williamson, Richard Chen, Melissa Zhao, Aaron Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, Amr Soliman, Chengkuan Chen, Tong Ding, Judy Wang, Georg Gerber, Ivy Liang, Long Le, Anil Parwani, Luca Weishaupt, and Faisal Mahmood. A multimodal generative ai copilot for human pathology. Nature, 634: 0 466--473, 06 2024...

work page doi:10.1038/s41586-024-07618-3 2024
[10]

Lucassen, Tijn van de Luijtgaarden, Sander P

Ruben T. Lucassen, Tijn van de Luijtgaarden, Sander P. J. Moonemans, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. On the importance of text preprocessing for multimodal representation learning and pathology report generation, 2025. URL https://arxiv.org/abs/2502.19285

work page arXiv 2025
[11]

Lucassen, Sander P

Ruben T. Lucassen, Sander P. J. Moonemans, Tijn van de Luijtgaarden, Gerben E. Breimer, Willeke A. M. Blokx, and Mitko Veta. Pathology report generation and multimodal representation learning for cutaneous melanocytic lesions. In James C. Gee, Daniel C. Alexander, Jaesung Hong, Juan Eugenio Iglesias, Carole H. Sudre, Archana Venkataraman, Polina Golland, ...

work page 2025
[12]

Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025

Dmitry Nechaev, Alexey Pchelnikov, and Ekaterina Ivanova. Histai: An open-source, large-scale whole slide image dataset for computational pathology, 2025. URL https://arxiv.org/abs/2505.12120

work page arXiv 2025
[13]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Prism: A multi-modal generative foundation model for slide-level histopathology

George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254, 2024

work page arXiv 2024
[15]

Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H

Manuel Tran, Paul Schmidle, Ruifeng Ray Guo, Sophia J. Wagner, Valentin Koch, Valerio Lupperger, Brenna Novotny, Dennis H. Murphree, Heather D. Hardway, Marina D'Amato, Judith Lefkes, Daan J. Geijs, Annette Feuchtinger, Alexander Böhner, Robert Kaczmarczyk, Tilo Biedermann, Avital L. Amir, Antien L. Mooyaart, Francesco Ciompi, Geert Litjens, Chen Wang, Nn...

work page doi:10.1038/s41467-025-60014-x 2025
[16]

Mart van Rijthoven, Witali Aswolinskiy, Leslie Tessier, Maschenka Balkenhol, Joep M. A. Bogaerts, Damien Drubay, Laura Comerma Blesa, Dieter Peeters, Elisabeth Specht Stovgaard, Anne-Vibeke L nkholm, Harry Haynes, Ligia Craciun, Denis Larsimont, Mohamed T. Amgad, Lee AD Cooper, Cyril de Kock, Valerie Dechering, Johannes Lotz, Nick Weiss, Mieke van Bocksta...

work page doi:10.1101/2025.02.28.25323078 2025
[17]

Kunz, Matthew C

Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Ellen Yang, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan H. Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Mill...

work page 2024
[18]

Bernhard, Ran A

Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Jinru Shia, Mithat Gonen, Martin R. Weiser, David S. Klimstra, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, and Siqi Liu. Prism2: Unlocking multi-modal general pathology ai with ...

work page arXiv 2025
[19]

The Cancer Genome Atlas Pan - Cancer analysis project

John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, Joshua M Stuart, and Cancer Genome Atlas Research Network . The Cancer Genome Atlas Pan - Cancer analysis project. Nature Genetics, 45 0 (10): 0 1113--1120, oct 2013. doi:10.1038/ng.2764

work page doi:10.1038/ng.2764 2013
[20]

A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025

Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, and Hao Chen. A versatile pathology co-pilot via reasoning enhanced multimodal large language model, 2025. URL https://arxiv.org/abs/2507.17303

work page arXiv 2025
[21]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022. URL https://arxiv.org/abs/2205.01917

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Accelerating data processing and benchmarking of ai models for pathology

Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Accelerating data processing and benchmarking of ai models for pathology. arXiv preprint arXiv:2502.06750, 2025

work page arXiv 2025