Online Dynamic Batching with Formal Guarantees for LLM Training
Pith reviewed 2026-06-26 15:55 UTC · model grok-4.3
The pith
Online Dynamic Batching forms batches after true cost observation and raises LLM training throughput 1.58-3.78x at comparable quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Online Dynamic Batching moves batch construction to the point of accurate cost observability after preprocessing and multimodal expansion, solves the Distributed Group Alignment Problem to enforce DDP step alignment, and proves deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure, delivering 1.58-3.78x higher emitted-sample throughput than fixed-batch training at equivalent quality without length caches or model changes.
What carries the argument
The Distributed Group Alignment Problem formalization that encodes DDP step synchronization for dynamic batches and supplies the deadlock-free termination proof under join-mode identity coverage.
Load-bearing premise
The synchronization requirement for preserving DDP step alignment can be formalized as the Distributed Group Alignment Problem such that deadlock-free bounded termination holds under the stated join-mode identity coverage and opt-in non-join sample-quota closure.
What would settle it
A standard DDP training run in which Online Dynamic Batching either deadlocks or exceeds the proven termination bound while using the default join modes would falsify the formal guarantee.
Figures
read the original abstract
Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Online Dynamic Batching (ODB), a drop-in DataLoader for LLM training that forms batches after observing true per-sample lengths following preprocessing, augmentation, and tokenization. It formalizes the DDP synchronization constraint as the Distributed Group Alignment Problem (DGAP) and asserts a proof of deadlock-free bounded termination under join-mode identity coverage and opt-in sample-quota closure. Empirical evaluation on 2B/8B Qwen3-VL models across UltraChat, LLaVA, and ShareGPT4o reports 1.58-3.78x throughput gains versus fixed-batch baselines with comparable quality, and competitive or superior results versus GMT/BMT oracles, with no changes to model, optimizer, or kernels.
Significance. If the DGAP termination argument is sound, ODB supplies a practical online solution for high-heterogeneity fine-tuning that avoids length-cache precomputation while preserving distributed step alignment. The reported speedups (up to 4.43x in production MM-Mix) and quality parity on public datasets would make the technique immediately usable for single- and multi-node Full FT/LoRA workloads. The formal guarantee is the distinguishing contribution relative to prior dynamic batching work.
major comments (2)
- [Abstract / §1] Abstract and §1: the central safety claim rests on a proof of deadlock-free bounded termination for DGAP under the stated join-mode identity coverage and opt-in non-join sample-quota closure, yet the manuscript contains no theorem statement, no proof sketch, and no formal model of DDP step alignment or join semantics. This omission is load-bearing for the 'formal guarantees' assertion.
- [§4] §4 (experimental setup): the throughput numbers (1.58-3.78x vs. Standard, within 15% of GMT/BMT) are reported for 2B/8B Qwen3-VL on UltraChat/LLaVA/ShareGPT4o, but the text does not specify whether DDP world-size, gradient-accumulation steps, or exact join-mode parameters were held constant across all compared systems; without these controls the cross-system comparison is not fully reproducible.
minor comments (2)
- [§3] Notation for join-mode identity coverage and sample-quota closure is introduced without a compact definition or pseudocode; a small table or algorithm box would improve readability.
- [Figures 3-5] Figure captions for the throughput plots should explicitly state the number of runs and whether error bars represent standard deviation or min/max.
Simulated Author's Rebuttal
We thank the referee for the thorough review and for highlighting the importance of the formal guarantees and experimental controls. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / §1] Abstract and §1: the central safety claim rests on a proof of deadlock-free bounded termination for DGAP under the stated join-mode identity coverage and opt-in non-join sample-quota closure, yet the manuscript contains no theorem statement, no proof sketch, and no formal model of DDP step alignment or join semantics. This omission is load-bearing for the 'formal guarantees' assertion.
Authors: We agree that the manuscript as submitted lacks an explicit theorem statement and proof sketch in the main body. While the DGAP formalization and termination argument are developed in §3, the presentation is informal. We will add a dedicated theorem (with statement of assumptions on join-mode identity coverage and sample-quota closure) together with a concise proof sketch to §3, including the required formal model of DDP step alignment and join semantics. This revision will make the central safety claim self-contained. revision: yes
-
Referee: [§4] §4 (experimental setup): the throughput numbers (1.58-3.78x vs. Standard, within 15% of GMT/BMT) are reported for 2B/8B Qwen3-VL on UltraChat/LLaVA/ShareGPT4o, but the text does not specify whether DDP world-size, gradient-accumulation steps, or exact join-mode parameters were held constant across all compared systems; without these controls the cross-system comparison is not fully reproducible.
Authors: All reported runs used identical DDP world size (8 GPUs/node for single-node, 16 GPUs for two-node), gradient-accumulation steps (=1), and join-mode settings (default identity coverage with opt-in quota closure). These parameters were fixed across Standard, ODB, GMT, and BMT. We will insert an explicit paragraph in §4 listing these controls and confirming they were held constant, thereby restoring full reproducibility. revision: yes
Circularity Check
No significant circularity; empirical results and formal claims remain independent.
full rationale
The paper reports throughput gains measured directly against external baselines (Standard fixed-batch, GMT/BMT oracles) on public datasets (UltraChat, LLaVA, ShareGPT4o). The DGAP formalization and termination claim is asserted as an internal proof rather than derived from or fitted to the reported speedups. No equations, parameters, or results reduce by construction to self-defined quantities, self-citations, or renamed inputs. The derivation chain for both performance and guarantees is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of distributed systems regarding message passing, process termination, and absence of Byzantine faults suffice to prove deadlock-free bounded termination for the DGAP.
invented entities (2)
-
Online Dynamic Batching (ODB) system
no independent evidence
-
Distributed Group Alignment Problem (DGAP)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023
Pith/arXiv arXiv 2023
-
[2]
Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[3]
Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jian- qiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report....
Pith/arXiv arXiv 2025
-
[4]
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
Pith/arXiv arXiv 2023
-
[5]
FlashAttention: Fast and memory-efficient exact attention with IO-awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[6]
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. UltraChat: Scaling alignment data for large language models with multi-round chat.arXiv preprint arXiv:2305.14233, 2023
Pith/arXiv arXiv 2023
-
[7]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Con- ference on Learning Representations (ICLR), 2021. 10
2021
-
[8]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInter- national Conference on Learning Representations (ICLR), 2022
2022
-
[9]
OpenNMT: Open-source toolkit for neural machine translation
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. OpenNMT: Open-source toolkit for neural machine translation. InProceedings of ACL 2017, System Demonstrations, 2017
2017
-
[10]
Reducing activation recomputation in large transformer models
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. InProceedings of Machine Learning and Systems (MLSys), 2023
2023
-
[11]
Mario Michael Krell, Matej Kosec, Sonia P Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. InarXiv preprint arXiv:2107.02027, 2021
arXiv 2021
-
[12]
NeMo: A toolkit for building AI applications using neural modules
Oleksii Kuchaiev, Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kriman, Stanislav Belber, Sandeep Subramanian, Vitaly Huang, et al. NeMo: A toolkit for building AI applications using neural modules. InarXiv preprint arXiv:1909.09577, 2019
arXiv 1909
-
[13]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles (SOSP), 2023
2023
-
[14]
LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
Pith/arXiv arXiv 2024
-
[15]
Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, and Yuxiong He. DeepSpeed data efficiency: Improving deep learning model quality and training efficiency via efficient data sampling and routing.arXiv preprint arXiv:2212.03597, 2022
arXiv 2022
-
[16]
Pytorch distributed: Expe- riences on accelerating data parallel training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Expe- riences on accelerating data parallel training. InProceedings of the VLDB Endowment, 2020
2020
-
[17]
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence par- allelism: Long sequence training from system perspective.arXiv preprint arXiv:2105.13120, 2021
arXiv 2021
-
[18]
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023
Pith/arXiv arXiv 2023
-
[19]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[20]
Mosaicml composer: A pytorch library for efficient neural network training
MosaicML. Mosaicml composer: A pytorch library for efficient neural network training. https://github.com/mosaicml/composer, 2022
2022
-
[21]
fairseq: A fast, extensible toolkit for sequence modeling
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grang- ier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019
2019
-
[22]
PyTorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
2019
-
[23]
DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InPro- ceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. 11
2020
-
[24]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model par- allelism.arXiv preprint arXiv:1909.08053, 2019
Pith/arXiv arXiv 1909
-
[25]
Transformers: State- of-the-art natural language processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State- of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020
2020
-
[26]
Orca: A distributed serving system for transformer-based generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In16th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI), 2022
2022
-
[27]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert AGI.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[28]
LLaMA-Factory: Unified efficient fine-tuning of 100+ language models.Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Ma, and Yongqiang Ma. LLaMA-Factory: Unified efficient fine-tuning of 100+ language models.Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
-
[29]
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023. A Cross-Rank Group Alignment Protocol This appendix gives the full algorithm and supporting details summarized in Section 2.3. LetG r be rankr’s current...
Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.