annbatch unlocks terabyte-scale training of biological data in anndata
Pith reviewed 2026-05-13 21:57 UTC · model grok-4.3
The pith
Annbatch enables out-of-core mini-batch training on terabyte-scale biological data in anndata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Annbatch is a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across benchmarks on single-cell transcriptomics, microscopy, and whole-genome sequencing, it increases loading throughput by up to an order of magnitude and shortens training from days to hours while remaining fully compatible with the scverse ecosystem.
What carries the argument
The annbatch mini-batch loader, which provides out-of-core access to heterogeneous metadata, sparse and dense assays in the anndata format for direct disk-backed training.
If this is right
- Training of machine learning models on large biological datasets can be completed in hours instead of days.
- Data loading throughput improves by up to an order of magnitude without altering existing data formats.
- Scalable AI applications become feasible on terabyte-scale biological data using standard community formats.
- Integration with downstream analysis tools in the scverse ecosystem remains seamless.
Where Pith is reading between the lines
- Future work could extend annbatch to support additional data modalities beyond the tested benchmarks.
- Adoption might standardize out-of-core practices across other scientific data formats.
- Combined with distributed computing, this could enable training on even larger multi-omics datasets.
Load-bearing premise
The performance gains observed in the benchmarks hold for typical real-world usage patterns without introducing hidden overheads or compatibility problems in production pipelines.
What would settle it
A benchmark on a new large biological dataset showing no increase in loading throughput or no reduction in training time compared to standard loaders would falsify the central claim.
read the original abstract
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces annbatch, a mini-batch data loader native to the anndata format that enables out-of-core training directly on disk-backed biological datasets whose size exceeds system memory. It reports up to an order-of-magnitude improvement in loading throughput and a reduction of training times from days to hours across single-cell transcriptomics, microscopy, and whole-genome sequencing benchmarks, while preserving full compatibility with the scverse ecosystem.
Significance. If the throughput and scaling claims are substantiated, annbatch would supply a missing practical infrastructure layer for terabyte-scale biological machine learning, allowing standard community formats to be used without custom data conversion pipelines. The work is notable for its empirical focus on real-world heterogeneous data (sparse matrices, metadata, mixed assays) rather than synthetic benchmarks.
major comments (2)
- [Abstract] Abstract and title assert terabyte-scale operation, yet no section, figure, or table reports explicit runs on disk-backed objects exceeding ~100 GB; the order-of-magnitude speedup and day-to-hour training claims therefore rest on unverified extrapolation of I/O latency, metadata overhead, and sparse-matrix chunking costs at true 1 TB+ scale.
- [Methods / Results] Benchmark details (exact dataset sizes, number of replicates, error bars, hardware configuration, and baseline implementations) are not provided in the methods or results sections; without these, it is impossible to assess whether the reported speedups are representative or whether hidden overheads appear in production pipelines.
minor comments (1)
- [Abstract] The GitHub link is given but no reproducibility checklist, exact command lines, or environment specifications are supplied, which would aid independent verification of the throughput numbers.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. We address each major comment below and have revised the manuscript to provide the requested clarifications and additional details.
read point-by-point responses
-
Referee: [Abstract] Abstract and title assert terabyte-scale operation, yet no section, figure, or table reports explicit runs on disk-backed objects exceeding ~100 GB; the order-of-magnitude speedup and day-to-hour training claims therefore rest on unverified extrapolation of I/O latency, metadata overhead, and sparse-matrix chunking costs at true 1 TB+ scale.
Authors: We acknowledge that our primary benchmarks were conducted on disk-backed datasets up to approximately 100 GB, which represented the largest scale feasible in our testing environment. The terabyte-scale claims in the abstract and title are supported by microbenchmark results demonstrating linear scaling of throughput with dataset size, combined with the out-of-core design of annbatch that relies on chunked, memory-mapped access rather than full in-memory loading. We have added a dedicated subsection to the Results section that presents a simple analytical model for I/O extrapolation, discusses potential metadata and sparse-matrix overheads at TB scale, and explicitly notes the absence of direct 1 TB+ runs. This revision clarifies the basis of the claims without overstating the empirical evidence. revision: yes
-
Referee: [Methods / Results] Benchmark details (exact dataset sizes, number of replicates, error bars, hardware configuration, and baseline implementations) are not provided in the methods or results sections; without these, it is impossible to assess whether the reported speedups are representative or whether hidden overheads appear in production pipelines.
Authors: We agree that the original manuscript lacked sufficient methodological detail. In the revised version we have expanded the Methods section to report exact dataset sizes (50 GB transcriptomics, 80 GB microscopy, 30 GB whole-genome sequencing), number of replicates (five independent runs per benchmark), error bars (standard deviation), hardware configuration (128 GB RAM, NVMe SSD, 32-core CPU), and baseline implementations (standard anndata loader, PyTorch IterableDataset, and scvi-tools data loader). Corresponding updates have been made to the Results section and figures, which now include error bars and a table summarizing all parameters. revision: yes
Circularity Check
No circularity: empirical software implementation with direct benchmarks
full rationale
The paper presents annbatch as a practical mini-batch loader for out-of-core training on disk-backed anndata objects. Its claims rest on measured throughput gains and training-time reductions across transcriptomics, microscopy, and WGS benchmarks, not on any mathematical derivation, parameter fitting, or predictive model. No equations, ansatzes, uniqueness theorems, or self-citation chains appear in the provided text that could reduce a result to its own inputs by construction. The contribution is self-contained as an engineering artifact whose value is assessed by external baseline comparisons, consistent with a score of 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40 , 163–166 (2022)
work page 2022
-
[2]
Clarke, B. et al. Integration of variant annotations using deep set networks boosts rare variant association testing. Nat. Genet. 56 , 2271–2280 (2024)
work page 2024
- [3]
-
[4]
Fischer, F. et al. scTab: Scaling cross-tissue single-cell annotation models. Nat. Commun. 15 , (2024)
work page 2024
-
[5]
Petastorm: Petastorm Library Enables Single Machine or Distributed Training and Evaluation of Deep Learning Models from Datasets in Apache Parquet Format. It Supports ML Frameworks such as Tensorflow, Pytorch, and PySpark and Can Be Used from Pure Python Code . (Github)
- [6]
-
[7]
Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Wolf, F. A. anndata: Access and store annotated data matrices. J. Open Source Softw. 9 , 4371 (2024)
work page 2024
-
[8]
D’Ascenzo, D. & di Montesano, S. C. ScDataset: Scalable data loading for deep learning on large-scale single-cell omics. arXiv [cs.LG] (2026) doi:10.48550/arXiv.2506.01883
-
[9]
John, P. S. et al. BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery. arXiv [cs.LG] (2024) doi:10.48550/ARXIV.2411.10548
- [10]
-
[11]
Rybakov, S. et al. MappedCollection: Weighted random sampling from large collections of scRNA-seq datasets. Lamin Blog (2024). 16
work page 2024
-
[12]
Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41 , 604–606 (2023)
work page 2023
-
[13]
Greco, D. & Rawlik, K. Same model, better performance: the impact of shuffling on DNA Language Models benchmarking. arXiv [q-bio.GN] (2025) doi:10.48550/arXiv.2510.12617
-
[14]
Byrska-Bishop, M. et al. High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. Cell 185 , 3426–3440.e19 (2022)
work page 2022
-
[15]
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018)
work page 2018
-
[16]
Genomic data in the All of Us Research Program
All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627 , 340–346 (2024)
work page 2024
- [17]
-
[18]
van Hilten, A. et al. GenNet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun. Biol. 4 , 1094 (2021)
work page 2021
-
[19]
Kelemen, M. et al. Performance of deep-learning-based approaches to improve polygenic scores. Nat. Commun. 16 , 5122 (2025)
work page 2025
-
[20]
Mädler, S. C. et al. scPortrait integrates single-cell images into multimodal modeling. bioRxiv (2025) doi:10.1101/2025.09.22.677590
- [21]
-
[22]
He, D. et al. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat. Methods 19 , 316–322 (2022)
work page 2022
- [23]
-
[24]
Ho, N. et al. Scaling dense representations for single cell with transcriptome-scale context. bioRxiv (2024) doi:10.1101/2024.11.28.625303. 17 Supp. Fig 1: Batch size versus total fit time. 18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.