pith. sign in

arxiv: 2605.15220 · v1 · pith:BYO6ZUZAnew · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.LG

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Pith reviewed 2026-05-19 17:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords data mixinglanguage model pretrainingcontinual learninglow-rank adaptersonline optimizationinstruction tuning
0
0 comments X

The pith

OP-Mix simulates candidate data mixtures by interpolating low-rank adapters trained on the current model, enabling efficient mixing across all phases of language model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats data mixing as a recurring online decision problem that spans pretraining, midtraining, and instruction tuning rather than isolated phases. It proposes OP-Mix, which evaluates mixture quality without separate proxy models by training low-rank adapters on the model itself and then interpolating their parameters to mimic the effect of different data blends. This keeps the search grounded in the model's actual learning dynamics at each step. Experiments show the method identifies near-optimal mixtures at a fraction of the compute cost of prior approaches, yielding a 6.3 percent drop in average perplexity during pretraining and matching the results of full retraining or distillation while using 66 to 95 percent less compute in continual-learning settings.

Core claim

OP-Mix operates across the entire language model training lifecycle by cheaply simulating candidate data mixtures through interpolation between low-rank adapters trained directly on the current model. This removes the need for fixed proxy models or domain assumptions and grounds every search step in the model's present learning dynamics. The approach improves average perplexity by 6.3 percent over training without mixing in pretraining and matches the performance of both retraining and on-policy distillation in continual learning while using 66 percent and 95 percent less overall compute, respectively.

What carries the argument

Interpolation of low-rank adapters trained on the current model to simulate the learning effects of different data mixtures.

If this is right

  • A single algorithm can replace separate mixing strategies for pretraining and continual learning phases.
  • Models can adapt their data composition on the fly without training dedicated proxy networks.
  • Continual learning can retain and acquire capabilities at the same level as full retraining while consuming far less compute.
  • Training can be viewed as one continuous process of learning from data rather than a sequence of distinct phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter-interpolation technique could be tested on non-language modalities where data mixing also matters.
  • If the method scales reliably, it might reduce the need for large-scale hyperparameter sweeps over data compositions.
  • Repeated application across many training runs could produce empirical maps of how mixture choices affect long-term capability retention.

Load-bearing premise

Interpolating between low-rank adapters trained directly on the current model accurately reproduces how different data mixtures would change the full model's learning dynamics.

What would settle it

A direct comparison in which a mixture selected by OP-Mix produces worse final perplexity or downstream performance than a carefully hand-tuned mixture after the same total compute budget would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15220 by Apurva Gandhi, Kyunghyun Cho, Michael Y. Hu, Pratyusha Sharma, Tal Linzen.

Figure 1
Figure 1. Figure 1: Overview of OP-MIX. OP-MIX aims to cheaply estimate optimal data mixing ratios in a continual setting. (1) Train a lightweight LoRA adapter on new domains to estimate future performance. (2) Interpolate adapters to simulate different data mixtures without retraining and then estimate the optimal mixture ratio. (3) Train the base model with the computed mixture [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OP-MIX (purple) Pareto-dominates the performance-efficiency frontier when tested alongside baselines across pretraining, continual midtraining, and continual instruction tuning. OP-MIX trains a single low-rank adapter (LoRA, Hu et al. [2022]) per data domain directly from the current model, keeping the proxy model on-policy with the model being trained—i.e., reflective of its current state. Second, it uses… view at source ↗
Figure 3
Figure 3. Figure 3: OP-MIX enables continual learning: For a 530M parameter model, OP-MIX mitigates forgetting 27% better on average and 71% better on Reddit than Continual SFT with WSD-S learning schedule, a method specifically designed for continual learning [Wen et al., 2025]. with a matching small-model proxy [Team et al., 2025, Grattafiori et al., 2024, Yang et al., 2025]. Moreover, separate smaller proxy models have bee… view at source ↗
Figure 4
Figure 4. Figure 4: Pretraining: OP-MIX outperforms empirical risk minimization (grey line), which samples from all data domains with uniform probability, and beats or matches the performance of other data mixing baselines while being up to 14% more efficient ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Continual midtraining: OP-MIX outperforms other continual learning baselines and is even competitive with full retraining (grey line), despite training on the datasets sequentially. finetuning. Although better than Continual SFT, LoRA-Merge is significantly worse than OP-MIX, indicating that there are benefits to using LoRA only as a proxy. Continual Instruction Tuning ( [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 6
Figure 6. Figure 6: Continual instruction tuning: OP-MIX works across cross-entropy and KL distillation objectives, improving the performance of both supervised finetuning and self-distillation fine tuning [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: OP-MIX (purple) closely tracks the true data mixing loss surface (red), which we obtain by running full finetuning to completion at each mixture, for different model sizes. in the continual midtraining setting and find that the average increase in loss of OP-MIX from the optimal proportions is 0.9%, compared to a 2.9% increase for the fixed 10% data replay baseline. 5.2 Theoretical Analysis: Formalizing OP… view at source ↗
Figure 8
Figure 8. Figure 8: OP-MIX versus grid sweep. In the continual midtraining setting, OP-MIX consistently achieves regret of 1.18% or less with respect to the optimal value (estimated by grid sweeping over mixtures). Regret does not grow as more datasets are introduced, unlike with a fixed 10% old data mixture, where regret does grow [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: OP-MIX versus retraining. In the continual midtraining setting, OP-MIX nearly matches the performance of retraining, indicating that it successfully mitigates catastrophic forgetting on previously seen datasets [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LoRA merging only is not sufficient. Simply merging in trained LoRA adapters (grey) [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OP-Mix, an on-policy data mixing algorithm for language models that simulates candidate mixtures by linearly interpolating low-rank adapters trained directly on the current model. This unified approach is claimed to operate across pretraining, continual midtraining, and instruction tuning without proxy models or fixed domain assumptions, yielding a 6.3% average perplexity improvement over no-mixing baselines in pretraining and matching the performance of full retraining and on-policy distillation in continual learning while using 66% and 95% less compute, respectively.

Significance. If the adapter-interpolation simulation is shown to faithfully reproduce mixture effects on full-model dynamics, the work provides a practical, compute-efficient solution to a recurring problem in LM training and supports a continuous rather than phased view of the training process. The on-policy grounding and elimination of separate proxies are clear strengths that could generalize across training stages.

major comments (3)
  1. [§3] §3 (OP-Mix algorithm description): The core efficiency claim rests on the assumption that linear interpolation of independently trained LoRA adapters accurately approximates the loss and gradient effects of training the full model on the corresponding data mixture. No direct validation experiment comparing interpolated predictions to actual full-model training on the same mixtures is reported, leaving open the possibility of systematic mismatch from non-additive interactions (e.g., in attention patterns or optimizer state).
  2. [§4.2] §4.2 and Table 2 (pretraining results): The headline 6.3% perplexity gain and compute-reduction figures are presented without reported standard deviations, number of random seeds, or explicit controls for total tokens seen; this makes it difficult to assess whether the gains are robust or attributable to the mixing policy versus other experimental factors.
  3. [§4.3] §4.3 (continual-learning experiments): The claim that OP-Mix matches retraining and distillation performance while using 66–95% less compute depends on the fidelity of the adapter-based simulation; without a quantitative measurement of interpolation error (e.g., KL divergence or loss difference between simulated and actual mixture trajectories), the compute-saving numbers cannot be taken as fully supported.
minor comments (2)
  1. [Abstract] Abstract: The phrases '66% and 95% less overall compute' should explicitly name the reference baselines (retraining vs. distillation) to avoid ambiguity.
  2. [§3.1] Notation: The interpolation coefficients and adapter rank are listed as free parameters; a short sensitivity analysis or default-value justification would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (OP-Mix algorithm description): The core efficiency claim rests on the assumption that linear interpolation of independently trained LoRA adapters accurately approximates the loss and gradient effects of training the full model on the corresponding data mixture. No direct validation experiment comparing interpolated predictions to actual full-model training on the same mixtures is reported, leaving open the possibility of systematic mismatch from non-additive interactions (e.g., in attention patterns or optimizer state).

    Authors: We appreciate the referee highlighting the importance of directly validating the adapter-interpolation approximation. The original manuscript supports the approach through consistent empirical gains across training stages rather than a dedicated head-to-head loss comparison. To address this directly, we will add a new experiment in the revised version that measures the difference between interpolated adapter predictions and actual full-model training losses on selected mixtures, including quantification of any discrepancies arising from non-linear interactions. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2 (pretraining results): The headline 6.3% perplexity gain and compute-reduction figures are presented without reported standard deviations, number of random seeds, or explicit controls for total tokens seen; this makes it difficult to assess whether the gains are robust or attributable to the mixing policy versus other experimental factors.

    Authors: We agree that reporting variability and experimental controls is necessary for robust interpretation. The pretraining experiments were conducted with three random seeds, with improvements consistent across runs, and all conditions used identical total token budgets. We will update Table 2 to report standard deviations and explicitly document the token-count controls in the revised manuscript. revision: yes

  3. Referee: [§4.3] §4.3 (continual-learning experiments): The claim that OP-Mix matches retraining and distillation performance while using 66–95% less compute depends on the fidelity of the adapter-based simulation; without a quantitative measurement of interpolation error (e.g., KL divergence or loss difference between simulated and actual mixture trajectories), the compute-saving numbers cannot be taken as fully supported.

    Authors: This is a fair observation on the evidential basis for the reported compute savings. We will incorporate quantitative fidelity metrics in the revised Section 4.3, such as average loss differences and KL divergence between the simulated adapter-interpolated trajectories and actual full-model mixture training on a subset of the continual-learning runs, to directly substantiate the simulation accuracy underlying the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OP-Mix is an empirical simulation method validated against external baselines.

full rationale

The paper presents OP-Mix as a practical online algorithm whose core step—simulating mixtures by linearly interpolating LoRA adapters each trained on a single data source using the live model—is introduced as an engineering insight to avoid proxy models. Performance claims (6.3% perplexity gain, 66-95% compute reduction while matching retraining/distillation) rest on direct experimental comparisons to independent baselines rather than any internal derivation that reduces a prediction to its own fitted inputs or self-citations. No equation or step equates the simulated mixture effect to the full-model outcome by construction; the interpolation accuracy is an empirical assumption tested against held-out results. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that low-rank adapter interpolation is a faithful proxy for full-model data mixing effects; no explicit free parameters or new entities are named in the abstract, but the method implicitly introduces adapter rank and interpolation coefficients as tunable elements.

free parameters (2)
  • adapter rank
    Rank of the low-rank adapters used for simulation; chosen to balance cost and fidelity.
  • interpolation coefficients
    Weights used to blend adapters when simulating a candidate mixture.
axioms (1)
  • domain assumption Low-rank adapters trained on the current model capture the directional effect of data sources on future updates.
    Invoked to justify why adapter interpolation can replace full retraining for mixture evaluation.

pith-pipeline@v0.9.0 · 5826 in / 1344 out tokens · 35555 ms · 2026-05-19T17:02:49.231906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2602.12237 , year=

    Olmix: A Framework for Data Mixing Throughout LM Development , author=. arXiv preprint arXiv:2602.12237 , year=

  2. [2]

    The Thirteenth International Conference on Learning Representations , year=

    Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance , author=. The Thirteenth International Conference on Learning Representations , year=

  3. [3]

    and Carbin, Michael , title =

    Frankle, Jonathan and Dziugaite, Gintare Karolina and Roy, Daniel M. and Carbin, Michael , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  4. [4]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  5. [5]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  6. [6]

    The Eleventh International Conference on Learning Representations , year=

    Progressive Prompts: Continual Learning for Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  7. [7]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  8. [8]

    A Survey on In-context Learning

    Dong, Qingxiu and Li, Lei and Dai, Damai and Zheng, Ce and Ma, Jingyuan and Li, Rui and Xia, Heming and Xu, Jingjing and Wu, Zhiyong and Chang, Baobao and Sun, Xu and Li, Lei and Sui, Zhifang. A Survey on In-context Learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.64

  9. [9]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  10. [10]

    2019 , eprint=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

  11. [11]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  12. [12]

    2026 , eprint=

    Midtraining Bridges Pretraining and Posttraining Distributions , author=. 2026 , eprint=

  13. [13]

    ArXiv , year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

  14. [14]

    NAACL , year =

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , author =. NAACL , year =

  15. [15]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , year=

  16. [16]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  17. [17]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  18. [18]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author=. arXiv preprint arXiv:1907.10641 , year=

  19. [19]

    doi: 10.18653/v1/n19-1421

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

  20. [20]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    2 OLMo 2 Furious , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  23. [23]

    and Roberts, Nicholas and Bhatia, Kush and Wang, Jue and Zhang, Ce and Sala, Frederic and R\'

    Chen, Mayee F. and Roberts, Nicholas and Bhatia, Kush and Wang, Jue and Zhang, Ce and Sala, Frederic and R\'. Skill-it! a data-driven skills framework for understanding and training language models , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

  24. [24]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  25. [25]

    The Fourteenth International Conference on Learning Representations , year=

    Cartridges: Lightweight and general-purpose long context representations via self-study , author=. The Fourteenth International Conference on Learning Representations , year=

  26. [26]

    The Thirteenth International Conference on Learning Representations , year=

    RegMix: Data Mixture as Regression for Language Model Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

  27. [27]

    2024 , url=

    Simin Fan and Matteo Pagliardini and Martin Jaggi , booktitle=. 2024 , url=

  28. [28]

    2025 , url=

    Simin Fan and Maria Ios Glarou and Martin Jaggi , booktitle=. 2025 , url=

  29. [29]

    The Thirteenth International Conference on Learning Representations , year=

    Aioli: A Unified Optimization Framework for Language Model Data Mixing , author=. The Thirteenth International Conference on Learning Representations , year=

  30. [30]

    2024 , eprint=

    Qwen2 Technical Report , author=. 2024 , eprint=

  31. [31]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. Journal of Machine Learning Research , year =

  32. [32]

    OLM o: Accelerating the science of language models

    Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

  33. [33]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  34. [34]

    Gradient Episodic Memory for Continual Learning , url =

    Lopez-Paz, David and Ranzato, Marc Aurelio , booktitle =. Gradient Episodic Memory for Continual Learning , url =

  35. [35]

    ICLR , year=

    Efficient Lifelong Learning with A-GEM , author=. ICLR , year=

  36. [36]

    Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

    Shin, Hanul and Lee, Jung Kwon and Kim, Jaehong and Kim, Jiwon , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

  37. [37]

    Experience Replay for Continual Learning , url =

    Rolnick, David and Ahuja, Arun and Schwarz, Jonathan and Lillicrap, Timothy and Wayne, Gregory , booktitle =. Experience Replay for Continual Learning , url =

  38. [38]

    Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III , pages =

    Aljundi, Rahaf and Babiloni, Francesca and Elhoseiny, Mohamed and Rohrbach, Marcus and Tuytelaars, Tinne , title =. Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III , pages =. 2018 , isbn =. doi:10.1007/978-3-030-01219-9_9 , abstract =

  39. [39]

    doi: 10.1073/pnas.1611835114

    James Kirkpatrick and Razvan Pascanu and Neil Rabinowitz and Joel Veness and Guillaume Desjardins and Andrei A. Rusu and Kieran Milan and John Quan and Tiago Ramalho and Agnieszka Grabska-Barwinska and Demis Hassabis and Claudia Clopath and Dharshan Kumaran and Raia Hadsell , title =. Proceedings of the National Academy of Sciences , volume =. 2017 , doi ...

  40. [40]

    URL https://www.sciencedirect.com/science/article/ pii/S0079742108605368

    Michael McCloskey and Neal J. Cohen , abstract =. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , editor =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =

  41. [41]

    ACM Comput

    Shi, Haizhou and Xu, Zihao and Wang, Hengyi and Qin, Weiyi and Wang, Wenyuan and Wang, Yibin and Wang, Zifeng and Ebrahimi, Sayna and Wang, Hao , title =. ACM Comput. Surv. , month = nov, articleno =. 2025 , issue_date =. doi:10.1145/3735633 , abstract =

  42. [42]

    The Thirteenth International Conference on Learning Representations , year=

    Spurious Forgetting in Continual Learning of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  43. [43]

    2022 , eprint=

    Progressive Neural Networks , author=. 2022 , eprint=

  44. [44]

    Orthogonal Subspace Learning for Language Model Continual Learning

    Wang, Xiao and Chen, Tianze and Ge, Qiming and Xia, Han and Bao, Rong and Zheng, Rui and Zhang, Qi and Gui, Tao and Huang, Xuanjing. Orthogonal Subspace Learning for Language Model Continual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.715

  45. [45]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

  46. [46]

    Second Conference on Language Modeling , year=

    Hyperparameter Loss Surfaces Are Simple Near their Optima , author=. Second Conference on Language Modeling , year=

  47. [47]

    The Thirteenth International Conference on Learning Representations , year=

    Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View , author=. The Thirteenth International Conference on Learning Representations , year=

  48. [48]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  49. [49]

    Forty-second International Conference on Machine Learning , year=

    Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection , author=. Forty-second International Conference on Machine Learning , year=

  50. [50]

    Advances in Neural Information Processing Systems , editor=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  51. [51]

    Journal of Machine Learning Research , year =

    Steven Diamond and Stephen Boyd , title =. Journal of Machine Learning Research , year =

  52. [52]

    2025 , month = sep, day =

    LoRA Without Regret , author =. 2025 , month = sep, day =

  53. [53]

    The Thirteenth International Conference on Learning Representations , year=

    Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws , author=. The Thirteenth International Conference on Learning Representations , year=

  54. [54]

    2024 , eprint=

    RedPajama: an Open Dataset for Training Large Language Models , author=. 2024 , eprint=

  55. [55]

    Forty-second International Conference on Machine Learning , year=

    DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. Forty-second International Conference on Machine Learning , year=

  56. [56]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , volume=

  57. [57]

    2026 , eprint=

    Self-Distillation Enables Continual Learning , author=. 2026 , eprint=

  58. [58]

    Thinking Machines Lab: Connectionism , year=

    On-policy distillation , author=. Thinking Machines Lab: Connectionism , year=

  59. [59]

    2026 , eprint=

    MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging , author=. 2026 , eprint=

  60. [60]

    Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  61. [61]

    2025 , eprint=

    Merge to Mix: Mixing Datasets via Model Merging , author=. 2025 , eprint=