pith. machine review for the scientific record. sign in

arxiv: 2604.21061 · v1 · submitted 2026-04-22 · 💻 cs.AI

Recognition: unknown

InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords embryo developmentvision-language modelsIVFfine-tuningnatural language generationtime-lapse imagingmulti-modal AIdevelopmental stages
0
0 comments X

The pith

A fine-tuned multi-modal model generates natural language descriptions of embryo morphology and development from time-lapse images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether foundational vision-language models can be adapted to IVF by fine-tuning them on embryo images paired with captions. Using only 1,000 examples from a public time-lapse dataset, the resulting InVitroVision model produces descriptions of morphology, cell cycles, and developmental stages that score higher than both the untuned base model and ChatGPT 5.2. Results improve as more training data is added. A sympathetic reader would care because the work shows a route to automated, consistent embryo assessments in fertility care without requiring huge custom-labeled datasets, and it points toward linking those descriptions to broader medical knowledge sources.

Core claim

The central claim is that fine-tuning PaliGemma-2 on image-caption pairs from a publicly available embryo time-lapse dataset yields InVitroVision, a model that outperforms both its base version and a commercial model (ChatGPT 5.2) at generating natural language descriptions of embryo morphology, embryonic cell cycle, and developmental stage. The approach succeeds with limited data and scales with larger training sets, demonstrating that foundational vision-language models can generalize to specialized IVF tasks.

What carries the argument

Fine-tuning of the PaliGemma-2 vision-language model on paired embryo images and captions, which maps visual features directly to descriptive text about morphology and stages.

Load-bearing premise

The captions supplied with the public embryo time-lapse dataset are accurate and representative enough that the fine-tuned model learns patterns that apply to new images.

What would settle it

Evaluating the fine-tuned model on a fresh set of embryo images whose captions were written by independent embryologists at a different clinic and finding that its descriptions match those expert captions at rates no higher than the base model or ChatGPT would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.21061 by Bernhard Schenkenfelder, Florian Kromp, Jasmin Primus, Mathias Brunbauer, Nicklas Neu, Raphael Zefferer, Thomas Ebner.

Figure 1
Figure 1. Figure 1: Developmental stages, embryonic cell cycles and morphokinetic variables. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Original examples of oocyte and embryo images, corresponding annotated [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes InVitroVision, obtained by fine-tuning the PaliGemma-2 vision-language model on a publicly available embryo time-lapse dataset using only 1,000 image-caption pairs. The captions describe embryo morphology, cell cycles, and developmental stages. The central claim is that the fine-tuned model outperforms ChatGPT 5.2 and the base models in generating natural language descriptions, with performance scaling with larger training sets, demonstrating the feasibility of adapting foundational multimodal models to IVF tasks with limited data.

Significance. If the reported outperformance is supported by rigorous quantitative evaluation, the work would be significant for showing that vision-language models can be adapted to specialized biomedical imaging domains using small datasets, enabling natural language outputs that could improve interpretability and allow integration with LLMs for retrieving scientific evidence in IVF contexts.

major comments (2)
  1. [Abstract and Results] The claim that InVitroVision 'outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics' (Abstract) is not accompanied by specific numerical values, comparison tables, baselines, error bars, or statistical tests, which are necessary to evaluate the strength and reliability of the central empirical claim.
  2. [Methods] The description of the publicly available dataset (Methods) does not include details on how the captions were generated (expert vs. automated) or any quality control/validation steps. This is load-bearing for the generalization claim, as the model may simply replicate inaccuracies or biases present in the training captions rather than learning accurate morphological descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us improve the clarity and rigor of our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and Results] The claim that InVitroVision 'outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics' (Abstract) is not accompanied by specific numerical values, comparison tables, baselines, error bars, or statistical tests, which are necessary to evaluate the strength and reliability of the central empirical claim.

    Authors: We agree that the abstract would be strengthened by including specific numerical support for the outperformance claim. While the Results section of the manuscript already contains the full quantitative evaluation—including performance metrics, comparison tables against ChatGPT 5.2 and the base models, baselines, error bars on figures, and statistical tests—we have revised the abstract to incorporate key numerical values (e.g., overall metric improvements) and a brief reference to the evaluation protocol. This change makes the central claim more self-contained without altering the manuscript's length constraints. revision: yes

  2. Referee: [Methods] The description of the publicly available dataset (Methods) does not include details on how the captions were generated (expert vs. automated) or any quality control/validation steps. This is load-bearing for the generalization claim, as the model may simply replicate inaccuracies or biases present in the training captions rather than learning accurate morphological descriptions.

    Authors: We thank the referee for identifying this important gap in transparency. The captions are part of the publicly released dataset and were generated via expert manual annotation by embryologists, with quality control steps (including consistency checks) documented in the original dataset release. We have expanded the Methods section to explicitly describe the caption generation process, note the expert origin of the annotations, reference the dataset source paper for full validation details, and discuss how this supports the generalization claims. This revision directly addresses the concern about potential replication of biases. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical fine-tuning on external public dataset

full rationale

The paper reports fine-tuning the pre-existing PaliGemma-2 vision-language model on a publicly available external embryo time-lapse dataset using 1,000 image-caption pairs. Results are obtained by direct comparison against ChatGPT 5.2 and base models on standard metrics, with performance scaling noted for larger training sets. No equations, self-defined parameters, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation. The central claim rests on empirical evaluation against external benchmarks rather than any internal reduction or ansatz. This matches the default case of a non-circular ML application paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the dataset captions are high-quality ground truth and that standard fine-tuning will produce generalizable descriptions. No free parameters or invented entities are explicitly introduced in the abstract.

free parameters (1)
  • Fine-tuning dataset size
    The choice of 1000 images is presented as sufficient for adaptation, but the exact selection and any hyperparameters are not detailed.
axioms (1)
  • domain assumption The captions accompanying the embryo images accurately describe morphology, cell cycle, and developmental stage.
    The model is trained to predict these captions, so their correctness is required for the learned mapping to be meaningful.

pith-pipeline@v0.9.0 · 5519 in / 1242 out tokens · 46660 ms · 2026-05-09T23:34:26.993503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    URLhttp://dx.doi.org/10.1016/j.rbmo.2021.09.022

    1016/j.rbmo.2021.09.022. URLhttp://dx.doi.org/10.1016/j.rbmo.2021.09.022. Nadir Ciray, Alison Campbell, Inge Errebo Agerholm, Jes´ us Aguilar, Sandrine Chamayou, Marga Esbert, and for The Time-Lapse User Group Sayed, Shabana. Proposed guide- lines on the nomenclature and annotation of dynamic human embryo monitoring by a time-lapse user group,

  2. [2]

    Zhao,Chin

    doi: 10.1016/j. URLhttps://doi.org/10. 1016/j. Tristan Gomez, Magalie Feyeux, Justine Boulant, Nicolas Normand, Laurent David, Per- rine Paul-Gilloteaux, Thomas Fr´ eour, and Harold Mouch` ere. A time-lapse embryo dataset for morphokinetic parameter prediction.Data in Brief, 42, 6

  3. [3]

    doi: 10.1016/j.dib.2022.108258

    ISSN 23523409. doi: 10.1016/j.dib.2022.108258. Ashish Goyal, Maheshwar Kuchana, and Kameswari Prasada Rao Ayyagari. Machine learning predicts live-birth occurrence before in-vitro fertilization treatment.Scientific Reports, 10, 12

  4. [4]

    doi: 10.1038/s41598-020-76928-z

    ISSN 20452322. doi: 10.1038/s41598-020-76928-z. 13 Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large lan-guage models. Technical report,

  5. [5]

    Medicine2, 10.1038/s41746-019-0096-y (2019)

    ISSN 23986352. doi: 10.1038/s41746-019-0096-y. F. Kromp, N. Neu, J. Primus, S. S. Hussain, S. Khan, and T. Ebner. Ai foundation model for natural-language descriptions of embryo morphology enables data-efficient down- stream training.Reproductive BioMedicine Online, 52, 2026b. ISSN 1472-6483. doi: 10.1016/j.rbmo.2026.105568. URLhttps://doi.org/10.1016/j.r...

  6. [6]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi

    Accessed: 2026-04-15. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. Technical report,

  7. [7]

    doi: 10.1038/s41746-024-01339-7

    ISSN 23986352. doi: 10.1038/s41746-024-01339-7. M Mart´ ınez, J Santal´ o, A Rodr´ ıguez, and R Vassena. High reliability of morphokinetic annotations among embryologists.Human Reproduction Open, 2018(3),

  8. [8]

    doi: 10.1093/hropen/hoy009

    ISSN 2399-3529. doi: 10.1093/hropen/hoy009. URLhttp://dx.doi.org/10.1093/hropen/ hoy009. Meta. Llama-4-scout-17b-16e: Natively multimodal mixture-of-experts model.https:// huggingface.co/meta-llama/Llama-4-Scout-17B-16E, April

  9. [9]

    Accessed: 2026- 04-16. N. Neu, T. Ebner, J. Primus, R. Zefferer, B. Schenkenfelder, M. Brunbauer, and F. Kromp. Expert-annotated embryo image dataset with natural language descrip- tions for evidence-based patient communication in ivf.arXiv preprint, 2026a. N. Neu, T. Ebner, J. Primus, R. Zefferer, B. Schenkenfelder, M. Brunbauer, and F. Kromp. Expert-ann...

  10. [10]

    Learning Transferable Visual Models From Natural Language Supervision

    URLhttp://arxiv.org/abs/2103.00020. Mohamed Salih, Christopher Austin, Krishna Mantravadi, Eva Seow, Sutthipat Ji- tanantawittaya, Sandeep Reddy, Beverley Vollenhoven, Hamid Rezatofighi, and Fab- rizzio Horta. Deep learning classification integrating embryo images with associated clinical information from art cycles.Scientific Reports, 15(1), May

  11. [11]

    doi: 10.1038/s41598-025-02076-x

    ISSN 2045-2322. doi: 10.1038/s41598-025-02076-x. URLhttp://dx.doi.org/10.1038/ s41598-025-02076-x. Masashi Shioya, Tatsuya Kobayashi, Shun Nakano, Tomoharu Sugiura, Miki Okabe- Kinoshita, Maki Fujita, and Keiichi Takahashi. Comparative analysis of pregnancy predictive potential using the deep learning blastocyst scoring model calculated from the single fo...

  12. [12]

    URLhttps: //doi.org/10.1142/S2661318223744107

    doi: 10.1142/S2661318223744107. URLhttps: //doi.org/10.1142/S2661318223744107. Andreas Steiner, Andr´ e Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin,...

  13. [13]

    URL http://arxiv.org/abs/2412.03555. L. Sundvall, H. J. Ingerslev, U. Breth Knudsen, and K. Kirkegaard. Inter- and intra- observer variability of time-lapse annotations.Human Reproduction, 28(12):3215–3221, September

  14. [14]

    doi: 10.1093/humrep/det366

    ISSN 1460-2350. doi: 10.1093/humrep/det366. URLhttp://dx.doi. org/10.1093/humrep/det366. Shanshan Wang, Cong Zhou, Dan Zhang, Lei Chen, and Haixiang Sun. A deep learning framework design for automatic blastocyst evaluation with multifocal images.IEEE Access, 9:18927–18934,

  15. [15]

    & Sun, H

    ISSN 21693536. doi: 10.1109/ACCESS.2021.3053098. 15