arxiv: 2604.16505 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Predicting Blastocyst Formation in IVF: Integrating DINOv2 and Attention-Based LSTM on Time-Lapse Embryo Images

Zahra Asghari Varzaneh , Niclas W\"olner-Hanssen , Reza Khoshkangini , Thomas Ebner , Magnus Johnsson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords IVFblastocyst predictionembryo selectiontime-lapse imagingDINOv2LSTMattention mechanismhybrid model

0 comments

The pith

A hybrid DINOv2 and attention LSTM model predicts which embryos will form blastocysts from limited daily images at 96.4 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that features extracted by a self-supervised vision model can be sequenced through an attention-augmented LSTM to forecast blastocyst formation even when only a handful of daily images are available instead of complete videos. This addresses a practical bottleneck in IVF clinics that cannot afford full time-lapse systems and still rely on subjective manual review. If the approach holds, embryo selection becomes more consistent and less dependent on continuous imaging hardware. The model was evaluated on 704 real embryo videos and maintained performance when frames were removed.

Core claim

The central claim is that DINOv2 extracts useful spatial features from embryo images and an LSTM equipped with multi-head attention then models their temporal progression to predict blastocyst formation, reaching 96.4 percent accuracy on a dataset of 704 videos while remaining robust to missing frames.

What carries the argument

The hybrid pipeline in which DINOv2 supplies per-image feature vectors that are then processed by a multi-head attention LSTM to capture developmental dynamics over time.

Load-bearing premise

The 704 embryo videos used for training and testing represent the range of imaging conditions and patient demographics encountered in other IVF laboratories.

What would settle it

Accuracy falling below 85 percent when the trained model is applied to embryo images collected at a different clinic with different time-lapse cameras or patient populations.

Figures

Figures reproduced from arXiv: 2604.16505 by Magnus Johnsson, Niclas W\"olner-Hanssen, Reza Khoshkangini, Thomas Ebner, Zahra Asghari Varzaneh.

**Figure 2.** Figure 2: Proportion of blastocyst formation over time. The blue line tracks the proportion of blastocyst-formatted embryos out of the total 704 embryos. The grey-dashed line follows the Active Embryos over time (Notice that some of the embryos are not annotated until +24 h). 5.1. Preprocessing data frames Each data sample is a time-lapse video that shows how the embryo grows during 5 to 6 days. These videos are con… view at source ↗

**Figure 3.** Figure 3: An overview of our proposed hybrid model:The process begins with DINOv2 extracting features from embryo frames, followed by temporal analysis using an LSTM with Multi-Head Attention and hyperparameter tuning. The model then classifies sequences into Blasto or Non-Blasto categories. 5.2. Image embedding with DINOv2 DINOv2 [27] is a self-supervised ViT-based foundation model that learns image representation… view at source ↗

**Figure 4.** Figure 4: A framework of DINOv2 architecture patch embeddings at those positions, encouraging fine-grained, locality-aware features crucial for dense prediction tasks [27]. By iteratively optimizing these objectives over the vast unlabeled dataset, DINOv2 learns a rich, hierarchical visual representation. Once pre-trained, the ViT backbone provides (i) a global [CLS] token embedding and (ii) a sequence of per-patch … view at source ↗

**Figure 5.** Figure 5: A framework of LSTM-Multi-Head Attention fusion. An architecture combining stacked LSTMs for temporal feature extraction with Multi-Head Attention to capture long-range dependencies, followed by normalization and classification layers. LSTM representation via a residual connection and normalized to stabilize training. Subsequently, the temporally aggregated features are flattened and passed through a clas… view at source ↗

**Figure 7.** Figure 7: ROC curve of LSTM-Multi-Head Attention Fusion is presented in [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The training history for loss and accuracy metrics. An early stopping patience [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

The selection of the optimal embryo for transfer is a critical yet challenging step in in vitro fertilization (IVF), primarily due to its reliance on the manual inspection of extensive time-lapse imaging data. A key obstacle in this process is predicting blastocyst formation from the limited number of daily images available. Many clinics also lack complete time-lapse systems, so full videos are often unavailable. In this study, we aimed to predict which embryos will develop into blastocysts using limited daily images from time-lapse recordings. We propose a novel hybrid model that combines DINOv2, a transformer-based vision model, with an enhanced long short-term memory (LSTM) network featuring a multi-head attention layer. DINOv2 extracts meaningful features from embryo images, and the LSTM model then uses these features to analyze embryo development over time and generate final predictions. We tested our model on a real dataset of 704 embryo videos. The model achieved 96.4% accuracy, surpassing existing methods. It also performs well with missing frames, making it valuable for many IVF laboratories with limited imaging systems. Our approach can assist embryologists in selecting better embryos more efficiently and with greater confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets 96.4% accuracy on blastocyst prediction from partial time-lapse images with DINOv2 plus attention LSTM, but supplies no details on splits or validation so the number cannot be trusted yet.

read the letter

The main thing to know is that this work reports 96.4% accuracy predicting which embryos reach blastocyst stage from limited daily frames in a set of 704 videos. It also shows the model still works when some frames are missing. That combination of DINOv2 feature extraction and an attention-augmented LSTM is the concrete contribution here, applied to a real clinical dataset rather than a new algorithm invented from scratch.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a hybrid model that uses DINOv2 to extract features from time-lapse embryo images and feeds them into an attention-augmented LSTM to predict blastocyst formation. It evaluates the approach on a dataset of 704 embryo videos, reports 96.4% accuracy (surpassing prior methods), and claims robustness when frames are missing.

Significance. If the accuracy claim survives proper patient-level cross-validation and external testing, the work would offer a practical aid for embryo selection in IVF clinics that lack complete time-lapse systems. The choice of a pre-trained vision transformer plus temporal attention is a reasonable modern adaptation, and explicit handling of incomplete sequences addresses a genuine clinical constraint.

major comments (2)

[Results] Results section: the headline 96.4% accuracy on 704 videos is presented without any information on train-test split ratios, patient- or embryo-level stratification, k-fold cross-validation, class balance, or statistical testing. In time-series embryo data, failure to isolate images from the same IVF cycle across splits risks leakage and renders the performance claim uninterpretable.
[Methods] Methods section: no description is given of how the 704 videos were acquired (number of patients, embryos per patient, imaging protocol, or exact daily sampling), nor of the baseline methods, their hyper-parameters, or the statistical tests used to assert superiority. These omissions make it impossible to assess whether the reported gains are reproducible or clinically meaningful.

minor comments (1)

[Abstract] The abstract would benefit from a single sentence on validation strategy to allow readers to gauge the 96.4% figure immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important omissions in our description of the experimental protocol. We agree that these details are necessary for assessing the validity of our results and will revise the manuscript accordingly to enhance transparency and reproducibility.

read point-by-point responses

Referee: [Results] Results section: the headline 96.4% accuracy on 704 videos is presented without any information on train-test split ratios, patient- or embryo-level stratification, k-fold cross-validation, class balance, or statistical testing. In time-series embryo data, failure to isolate images from the same IVF cycle across splits risks leakage and renders the performance claim uninterpretable.

Authors: We agree that the original manuscript omitted these critical details on the evaluation protocol, which is a valid concern given the risk of data leakage in time-series embryo imaging. In the revised version, we will add a dedicated subsection detailing the train-test split ratios, patient-level stratification, k-fold cross-validation procedure, class balance, and the statistical tests used to compare against baselines. This will directly address the potential for leakage and make the 96.4% accuracy claim fully interpretable. revision: yes
Referee: [Methods] Methods section: no description is given of how the 704 videos were acquired (number of patients, embryos per patient, imaging protocol, or exact daily sampling), nor of the baseline methods, their hyper-parameters, or the statistical tests used to assert superiority. These omissions make it impossible to assess whether the reported gains are reproducible or clinically meaningful.

Authors: We acknowledge that the Methods section was insufficiently detailed regarding dataset acquisition and the implementation of baselines. We will expand this section in the revision to describe the acquisition process (including patient and embryo counts, imaging protocol, and daily sampling), provide full descriptions of the baseline methods along with their hyper-parameters, and specify the statistical tests employed. These additions will support reproducibility and allow readers to better evaluate the clinical relevance of the reported improvements. revision: yes

Circularity Check

0 steps flagged

Standard supervised ML pipeline with no circular derivation

full rationale

The paper describes a conventional supervised learning setup: DINOv2 extracts image features from time-lapse embryo frames, these features are fed into an LSTM with multi-head attention for temporal modeling, the network is trained on labeled videos, and accuracy is measured on held-out test data. No load-bearing step reduces by construction to its own inputs, no fitted parameter is relabeled as a prediction, and no self-citation chain is invoked to justify the architecture or results. The reported 96.4% accuracy is an empirical evaluation metric, not a tautological consequence of the model definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied deep-learning study that relies on a pre-trained DINOv2 backbone and standard LSTM training; the abstract lists no explicit free parameters, axioms, or invented entities beyond the model architecture itself.

pith-pipeline@v0.9.0 · 5542 in / 1125 out tokens · 50080 ms · 2026-05-10T15:58:30.275815+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Eugster, A

A. Eugster, A. J. Vingerhoets, Psychological aspects of in vitro fertil- ization: a review, Social science & medicine 48 (5) (1999) 575–589

1999
[2]

D. A. Blake, M. Proctor, N. Johnson, D. Olive, C. M. Farquhar, Q. Lam- berts, Cleavage stage versus blastocyst stage embryo transfer in assisted conception, Cochrane Database of Systematic Reviews (4) (2005)

2005
[3]

H. M. Lukassen, D. D. Braat, A. M. Wetzels, G. A. Zielhuis, E. M. Adang, E. Scheenjes, J. A. Kremer, Two cycles with single embryo transfer versus one cycle with double embryo transfer: a randomized controlled trial, Human Reproduction 20 (3) (2005) 702–708

2005
[4]

J.E.Swain, Decisionsfortheivflaboratory: comparativeanalysisofem- bryo culture incubators, Reproductive biomedicine online 28 (5) (2014) 535–547

2014
[5]

C. Wong, A. Chen, B. Behr, S. Shen, Time-lapse microscopy and image analysis in basic and clinical embryo development research, Reproduc- tive BioMedicine Online 26 (2) (2013) 120–129

2013
[6]

Q. Liao, Q. Zhang, X. Feng, H. Huang, H. Xu, B. Tian, J. Liu, Q. Yu, N. Guo, Q. Liu, et al., Development of deep learning algorithms for predicting blastocyst formation and quality by time-lapse monitoring, Communications biology 4 (1) (2021) 415

2021
[7]

Machtinger, C

R. Machtinger, C. Racowsky, Morphological systems of human embryo assessmentandclinicalevidence, Reproductivebiomedicineonline26(3) (2013) 210–221

2013
[8]

Motato, M

Y. Motato, M. J. de los Santos, M. J. Escriba, B. A. Ruiz, J. Remohí, M. Meseguer, Morphokinetic analysis and embryonic prediction for blas- tocyst formation through an integrated time-lapse system, Fertility and sterility 105 (2) (2016) 376–384. 22

2016
[9]

Z. A. Varzaneh, A. Orooji, L. Erfannia, M. Shanbehzadeh, A new covid- 19 intubation prediction strategy using an intelligent feature selection and k-nn method, Informatics in medicine unlocked 28 (2022) 100825

2022
[10]

Jamali, P

M. Jamali, P. Davidsson, R. Khoshkangini, M. G. Ljungqvist, R.-C. Mihailescu, Context in object detection: a systematic literature review, Artificial Intelligence Review 58 (6) (2025) 1–89

2025
[11]

Z. A. Varzaneh, S. M. Mousavi, R. Khoshkangini, S. M. Moosavi Khaliji, An ensemble model based on transfer learning for the early detection of alzheimer’s disease, Scientific Reports 15 (1) (2025) 34634

2025
[12]

D. Shen, G. Wu, H.-I. Suk, Deep learning in medical image analysis, Annual review of biomedical engineering 19 (1) (2017) 221–248

2017
[13]

M. I. Razzak, S. Naz, A. Zaib, Deep learning for medical image pro- cessing: Overview, challenges and the future, Classification in BioApps: Automation of decision making (2017) 323–350

2017
[14]

E. I. Fernandez, A. S. Ferreira, M. H. M. Cecílio, D. S. Chéles, R. C. M. de Souza, M. F. G. Nogueira, J. C. Rocha, Artificial intelligence in the ivf laboratory: overview through the application of different types of algorithms for the classification of reproductive data, Journal of Assisted Reproduction and Genetics 37 (10) (2020) 2359–2376

2020
[15]

Luong, N

T.-M.-T. Luong, N. Q. K. Le, Artificial intelligence in time-lapse sys- tem: advances, applications, and future perspectives in reproductive medicine, Journal of assisted reproduction and genetics 41 (2) (2024) 239–252

2024
[16]

Abbasi, P

M. Abbasi, P. Saeedi, J. Au, J. Havelock, Time series classification for modality-converted videos: A case study on predicting human embryo implantation from time-lapse images, in: 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), IEEE, 2023, pp. 1–6

2023
[17]

A.Sharma, A.Dorobantiu, S.Ali, M.Iliceto, M.H.Stensen, E.Delbarre, M. A. Riegler, H. L. Hammer, Deep learning methods to forecasting human embryo development in time-lapse videos, bioRxiv (2024) 2024– 03. 23

2024
[18]

Kalyani, P

K. Kalyani, P. S. Deshpande, A deep learning model for predicting blas- tocyst formation from cleavage-stage human embryos using time-lapse images, Scientific Reports 14 (1) (2024) 28019

2024
[19]

Gomez, M

T. Gomez, M. Feyeux, J. Boulant, N. Normand, L. David, P. Paul- Gilloteaux, T. Fréour, H. Mouchère, A time-lapse embryo dataset for morphokinetic parameter prediction, Data in Brief 42 (2022) 108258

2022
[20]

Y. A. Mohamed, U. K. Yusof, I. S. Isa, M. M. Zain, An automated blas- tocyst grading system using convolutional neural network and transfer learning, in: 2023 IEEE 13th International Conference on Control Sys- tem, Computing and Engineering (ICCSCE), IEEE, 2023, pp. 202–207

2023
[21]

A.A.Mazroa, M.Maashi, Y.Said, M.Maray, A.A.Alzahrani, A.Alkha- rashi, A. M. Al-Sharafi, Anomaly detection in embryo development and morphology using medical computer vision-aided swin transformer with boosted dipper-throated optimization algorithm, Bioengineering 11 (10) (2024) 1044

2024
[22]

J. Kim, Z. Shi, D. Jeong, J. Knittel, H. Y. Yang, Y. Song, W. Li, Y. Li, D. Ben-Yosef, D. Needleman, et al., Multimodal learning for embryo vi- ability prediction in clinical ivf, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2024, pp. 542–552

2024
[23]

X. Xie, P. Yan, F.-Y. Cheng, F. Gao, Q. Mai, G. Li, Early prediction of blastocyst development via time-lapse video analysis, in: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), IEEE, 2022, pp. 1–5

2022
[24]

K. Garg, A. Dev, P. Bansal, H. Mittal, An efficient deep learning model for embryo classification, in: 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE, 2024, pp. 358–363

2024
[25]

Z. A. Varzaneh, N. Wölner-Hanssen, R. Khoshkangini, A lightweight transformer approach for predicting blastocyst formation on limited em- bryo images, in: 2025 International Conference on Visual Communica- tions and Image Processing (VCIP), IEEE, 2025, pp. 1–5. 24

2025
[26]

P. C. of the American Society for Reproductive Medicine, P. C. of the Society for Assisted Reproductive Technology, et al., Blastocyst culture and transfer in clinically assisted reproduction: a committee opinion, Fertility and Sterility 110 (7) (2018) 1246–1252

2018
[27]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Hashemi, Enlarging smaller images before inputting into convolu- tional neural network: zero-padding vs

M. Hashemi, Enlarging smaller images before inputting into convolu- tional neural network: zero-padding vs. interpolation, Journal of Big Data 6 (1) (2019) 1–13

2019
[29]

Y. Yu, X. Si, C. Hu, J. Zhang, A review of recurrent neural networks: Lstm cells and network architectures, Neural computation 31 (7) (2019) 1235–1270

2019
[30]

D. Neil, M. Pfeiffer, S.-C. Liu, Phased lstm: Accelerating recurrent net- work training for long or event-based sequences, Advances in neural information processing systems 29 (2016)

2016
[31]

S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, A. Muneer, E. H. Sum- iea, A. Alqushaibi, M. G. Ragab, Rnn-lstm: From applications to mod- eling techniques and beyond—systematic review, Journal of King Saud University-Computer and Information Sciences (2024) 102068

2024
[32]

Multi-head attention: Collaborate instead of concatenate.arXiv preprint arXiv:2006.16362,

J.-B. Cordonnier, A. Loukas, M. Jaggi, Multi-head attention: Collabo- rate instead of concatenate, arXiv preprint arXiv:2006.16362 (2020)

work page arXiv 2006
[33]

Z. C. Lipton, D. C. Kale, C. Elkan, R. Wetzel, Learning to diagnose with lstm recurrent neural networks, arXiv preprint arXiv:1511.03677 (2015)

work page arXiv 2015
[34]

Naidu, T

G. Naidu, T. Zuva, E. M. Sibanda, A review of evaluation metrics in machine learning algorithms, in: Computer science on-line conference, Springer, 2023, pp. 15–25

2023
[35]

Vujović, et al., Classification model evaluation metrics, International Journal of Advanced Computer Science and Applications 12 (6) (2021) 599–606

Ž. Vujović, et al., Classification model evaluation metrics, International Journal of Advanced Computer Science and Applications 12 (6) (2021) 599–606. 25

2021
[36]

URLhttps://www.kaggle.com/datasets/modlee/time-series- classification-data/data

Modlee, Car (2024). URLhttps://www.kaggle.com/datasets/modlee/time-series- classification-data/data

2024
[37]

URLhttps://www.kaggle.com/datasets/shebrahimi/financial- distress?select=Financial+Distress.csv

Ebrahimi, Financial (2017). URLhttps://www.kaggle.com/datasets/shebrahimi/financial- distress?select=Financial+Distress.csv

2017
[38]

Candanedo, Occupancy (2016)

L. Candanedo, Occupancy (2016). URLhttps://archive.ics.uci.edu/dataset/357/occupancy+dete ction

2016
[39]

Roesler, Eeg (2016)

O. Roesler, Eeg (2016). URLhttps://archive.ics.uci.edu/dataset/264/eeg+eye+state 26

2016