Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
Pith reviewed 2026-05-19 17:02 UTC · model grok-4.3
The pith
OP-Mix simulates candidate data mixtures by interpolating low-rank adapters trained on the current model, enabling efficient mixing across all phases of language model training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OP-Mix operates across the entire language model training lifecycle by cheaply simulating candidate data mixtures through interpolation between low-rank adapters trained directly on the current model. This removes the need for fixed proxy models or domain assumptions and grounds every search step in the model's present learning dynamics. The approach improves average perplexity by 6.3 percent over training without mixing in pretraining and matches the performance of both retraining and on-policy distillation in continual learning while using 66 percent and 95 percent less overall compute, respectively.
What carries the argument
Interpolation of low-rank adapters trained on the current model to simulate the learning effects of different data mixtures.
If this is right
- A single algorithm can replace separate mixing strategies for pretraining and continual learning phases.
- Models can adapt their data composition on the fly without training dedicated proxy networks.
- Continual learning can retain and acquire capabilities at the same level as full retraining while consuming far less compute.
- Training can be viewed as one continuous process of learning from data rather than a sequence of distinct phases.
Where Pith is reading between the lines
- The same adapter-interpolation technique could be tested on non-language modalities where data mixing also matters.
- If the method scales reliably, it might reduce the need for large-scale hyperparameter sweeps over data compositions.
- Repeated application across many training runs could produce empirical maps of how mixture choices affect long-term capability retention.
Load-bearing premise
Interpolating between low-rank adapters trained directly on the current model accurately reproduces how different data mixtures would change the full model's learning dynamics.
What would settle it
A direct comparison in which a mixture selected by OP-Mix produces worse final perplexity or downstream performance than a carefully hand-tuned mixture after the same total compute budget would falsify the central claim.
Figures
read the original abstract
Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OP-Mix, an on-policy data mixing algorithm for language models that simulates candidate mixtures by linearly interpolating low-rank adapters trained directly on the current model. This unified approach is claimed to operate across pretraining, continual midtraining, and instruction tuning without proxy models or fixed domain assumptions, yielding a 6.3% average perplexity improvement over no-mixing baselines in pretraining and matching the performance of full retraining and on-policy distillation in continual learning while using 66% and 95% less compute, respectively.
Significance. If the adapter-interpolation simulation is shown to faithfully reproduce mixture effects on full-model dynamics, the work provides a practical, compute-efficient solution to a recurring problem in LM training and supports a continuous rather than phased view of the training process. The on-policy grounding and elimination of separate proxies are clear strengths that could generalize across training stages.
major comments (3)
- [§3] §3 (OP-Mix algorithm description): The core efficiency claim rests on the assumption that linear interpolation of independently trained LoRA adapters accurately approximates the loss and gradient effects of training the full model on the corresponding data mixture. No direct validation experiment comparing interpolated predictions to actual full-model training on the same mixtures is reported, leaving open the possibility of systematic mismatch from non-additive interactions (e.g., in attention patterns or optimizer state).
- [§4.2] §4.2 and Table 2 (pretraining results): The headline 6.3% perplexity gain and compute-reduction figures are presented without reported standard deviations, number of random seeds, or explicit controls for total tokens seen; this makes it difficult to assess whether the gains are robust or attributable to the mixing policy versus other experimental factors.
- [§4.3] §4.3 (continual-learning experiments): The claim that OP-Mix matches retraining and distillation performance while using 66–95% less compute depends on the fidelity of the adapter-based simulation; without a quantitative measurement of interpolation error (e.g., KL divergence or loss difference between simulated and actual mixture trajectories), the compute-saving numbers cannot be taken as fully supported.
minor comments (2)
- [Abstract] Abstract: The phrases '66% and 95% less overall compute' should explicitly name the reference baselines (retraining vs. distillation) to avoid ambiguity.
- [§3.1] Notation: The interpolation coefficients and adapter rank are listed as free parameters; a short sensitivity analysis or default-value justification would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (OP-Mix algorithm description): The core efficiency claim rests on the assumption that linear interpolation of independently trained LoRA adapters accurately approximates the loss and gradient effects of training the full model on the corresponding data mixture. No direct validation experiment comparing interpolated predictions to actual full-model training on the same mixtures is reported, leaving open the possibility of systematic mismatch from non-additive interactions (e.g., in attention patterns or optimizer state).
Authors: We appreciate the referee highlighting the importance of directly validating the adapter-interpolation approximation. The original manuscript supports the approach through consistent empirical gains across training stages rather than a dedicated head-to-head loss comparison. To address this directly, we will add a new experiment in the revised version that measures the difference between interpolated adapter predictions and actual full-model training losses on selected mixtures, including quantification of any discrepancies arising from non-linear interactions. revision: yes
-
Referee: [§4.2] §4.2 and Table 2 (pretraining results): The headline 6.3% perplexity gain and compute-reduction figures are presented without reported standard deviations, number of random seeds, or explicit controls for total tokens seen; this makes it difficult to assess whether the gains are robust or attributable to the mixing policy versus other experimental factors.
Authors: We agree that reporting variability and experimental controls is necessary for robust interpretation. The pretraining experiments were conducted with three random seeds, with improvements consistent across runs, and all conditions used identical total token budgets. We will update Table 2 to report standard deviations and explicitly document the token-count controls in the revised manuscript. revision: yes
-
Referee: [§4.3] §4.3 (continual-learning experiments): The claim that OP-Mix matches retraining and distillation performance while using 66–95% less compute depends on the fidelity of the adapter-based simulation; without a quantitative measurement of interpolation error (e.g., KL divergence or loss difference between simulated and actual mixture trajectories), the compute-saving numbers cannot be taken as fully supported.
Authors: This is a fair observation on the evidential basis for the reported compute savings. We will incorporate quantitative fidelity metrics in the revised Section 4.3, such as average loss differences and KL divergence between the simulated adapter-interpolated trajectories and actual full-model mixture training on a subset of the continual-learning runs, to directly substantiate the simulation accuracy underlying the efficiency claims. revision: yes
Circularity Check
No significant circularity; OP-Mix is an empirical simulation method validated against external baselines.
full rationale
The paper presents OP-Mix as a practical online algorithm whose core step—simulating mixtures by linearly interpolating LoRA adapters each trained on a single data source using the live model—is introduced as an engineering insight to avoid proxy models. Performance claims (6.3% perplexity gain, 66-95% compute reduction while matching retraining/distillation) rest on direct experimental comparisons to independent baselines rather than any internal derivation that reduces a prediction to its own fitted inputs or self-citations. No equation or step equates the simulated mixture effect to the full-model outcome by construction; the interpolation accuracy is an empirical assumption tested against held-out results. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- adapter rank
- interpolation coefficients
axioms (1)
- domain assumption Low-rank adapters trained on the current model capture the directional effect of data sources on future updates.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2602.12237 , year=
Olmix: A Framework for Data Mixing Throughout LM Development , author=. arXiv preprint arXiv:2602.12237 , year=
-
[2]
The Thirteenth International Conference on Learning Representations , year=
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , author=. The Thirteenth International Conference on Learning Representations , year=
-
[3]
Frankle, Jonathan and Dziugaite, Gintare Karolina and Roy, Daniel M. and Carbin, Michael , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =
work page 2020
-
[4]
doi:10.5281/zenodo.12608602 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[5]
Proceedings of the 39th International Conference on Machine Learning , pages =
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =
work page 2022
-
[6]
The Eleventh International Conference on Learning Representations , year=
Progressive Prompts: Continual Learning for Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[7]
Proceedings of the International Conference on Learning Representations (ICLR) , year=
Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
-
[8]
A Survey on In-context Learning
Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64
-
[9]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[10]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=
work page 2019
-
[11]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[12]
Midtraining Bridges Pretraining and Posttraining Distributions , author=. 2026 , eprint=
work page 2026
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=
-
[14]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. NAACL , year =
-
[15]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=
-
[16]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[17]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[18]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[19]
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...
- [20]
- [21]
- [22]
-
[23]
and Roberts, Nicholas and Bhatia, Kush and Wang, Jue and Zhang, Ce and Sala, Frederic and R\'
Chen, Mayee F. and Roberts, Nicholas and Bhatia, Kush and Wang, Jue and Zhang, Ce and Sala, Frederic and R\'. Skill-it! a data-driven skills framework for understanding and training language models , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
- [24]
-
[25]
The Fourteenth International Conference on Learning Representations , year=
Cartridges: Lightweight and general-purpose long context representations via self-study , author=. The Fourteenth International Conference on Learning Representations , year=
-
[26]
The Thirteenth International Conference on Learning Representations , year=
RegMix: Data Mixture as Regression for Language Model Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=
-
[27]
Simin Fan and Matteo Pagliardini and Martin Jaggi , booktitle=. 2024 , url=
work page 2024
-
[28]
Simin Fan and Maria Ios Glarou and Martin Jaggi , booktitle=. 2025 , url=
work page 2025
-
[29]
The Thirteenth International Conference on Learning Representations , year=
Aioli: A Unified Optimization Framework for Language Model Data Mixing , author=. The Thirteenth International Conference on Learning Representations , year=
- [30]
-
[31]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =
-
[32]
OLM o: Accelerating the science of language models
Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...
-
[33]
Thirty-seventh Conference on Neural Information Processing Systems , year=
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[34]
Gradient Episodic Memory for Continual Learning , url =
Lopez-Paz, David and Ranzato, Marc Aurelio , booktitle =. Gradient Episodic Memory for Continual Learning , url =
- [35]
-
[36]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
Shin, Hanul and Lee, Jung Kwon and Kim, Jaehong and Kim, Jiwon , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =
work page 2017
-
[37]
Experience Replay for Continual Learning , url =
Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , booktitle =. Experience Replay for Continual Learning , url =
-
[38]
Aljundi, Rahaf and Babiloni, Francesca and Elhoseiny, Mohamed and Rohrbach, Marcus and Tuytelaars, Tinne , title =. Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III , pages =. 2018 , isbn =. doi:10.1007/978-3-030-01219-9_9 , abstract =
-
[39]
James Kirkpatrick and Razvan Pascanu and Neil Rabinowitz and Joel Veness and Guillaume Desjardins and Andrei A. Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi ...
-
[40]
URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368
Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =
-
[41]
Shi, Haizhou and Xu, Zihao and Wang, Hengyi and Qin, Weiyi and Wang, Wenyuan and Wang, Yibin and Wang, Zifeng and Ebrahimi, Sayna and Wang, Hao , title =. ACM Comput. Surv. , month = nov, articleno =. 2025 , issue_date =. doi:10.1145/3735633 , abstract =
-
[42]
The Thirteenth International Conference on Learning Representations , year=
Spurious Forgetting in Continual Learning of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
- [43]
-
[44]
Orthogonal Subspace Learning for Language Model Continual Learning
Wang, Xiao and Chen, Tianze and Ge, Qiming and Xia, Han and Bao, Rong and Zheng, Rui and Zhang, Qi and Gui, Tao and Huang, Xuanjing. Orthogonal Subspace Learning for Language Model Continual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.715
-
[45]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Second Conference on Language Modeling , year=
Hyperparameter Loss Surfaces Are Simple Near their Optima , author=. Second Conference on Language Modeling , year=
-
[47]
The Thirteenth International Conference on Learning Representations , year=
Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View , author=. The Thirteenth International Conference on Learning Representations , year=
-
[48]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[49]
Forty-second International Conference on Machine Learning , year=
Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection , author=. Forty-second International Conference on Machine Learning , year=
-
[50]
Advances in Neural Information Processing Systems , editor=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[51]
Journal of Machine Learning Research , year =
Steven Diamond and Stephen Boyd , title =. Journal of Machine Learning Research , year =
- [52]
-
[53]
The Thirteenth International Conference on Learning Representations , year=
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws , author=. The Thirteenth International Conference on Learning Representations , year=
-
[54]
RedPajama: an Open Dataset for Training Large Language Models , author=. 2024 , eprint=
work page 2024
-
[55]
Forty-second International Conference on Machine Learning , year=
DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. Forty-second International Conference on Machine Learning , year=
-
[56]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Self-Distillation Enables Continual Learning , author=. 2026 , eprint=
work page 2026
-
[58]
Thinking Machines Lab: Connectionism , year=
On-policy distillation , author=. Thinking Machines Lab: Connectionism , year=
-
[59]
MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging , author=. 2026 , eprint=
work page 2026
-
[60]
Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
-
[61]
Merge to Mix: Mixing Datasets via Model Merging , author=. 2025 , eprint=
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.