pith. machine review for the scientific record. sign in

arxiv: 2604.16380 · v1 · submitted 2026-03-25 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords data mixingLLM pretrainingdata compositionbilevel optimizationstatic mixingdynamic mixingtaxonomytransferability
0
0 comments X

The pith

Data mixing for LLM pretraining is formalized as a bilevel optimization problem on the probability simplex and classified into static and dynamic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how to optimally mix data from different domains when pretraining large language models under limited compute and data budgets. It formalizes the optimization of domain sampling weights as a bilevel problem where the upper level selects weights and the lower level trains the model. This matters because the composition of training data strongly influences both training efficiency and how well the model generalizes to new tasks. The survey organizes prior work into a taxonomy separating static methods, which fix weights in advance, from dynamic methods that adjust weights during training. It then examines practical challenges such as methods failing to transfer across different settings and the lack of standard evaluation practices.

Core claim

This survey formalizes data mixture optimization as a bilevel problem on the probability simplex and introduces a fine-grained taxonomy that divides existing methods into static mixing, further split into rule-based and learning-based, and dynamic mixing, grouped into adaptive and externally guided families. For each category it summarizes representative approaches and analyzes their performance-cost trade-offs. It identifies cross-cutting challenges including limited transferability across domains, models, and validation sets, plus unstandardized benchmarks, and proposes future directions such as finer-grained domain partitioning and pipeline-aware designs.

What carries the argument

The bilevel optimization formulation on the probability simplex, which treats domain weights as upper-level variables optimized to make the inner model training process tractable under fixed budgets.

Load-bearing premise

That the taxonomy dividing methods into static versus dynamic categories along with their subfamilies comprehensively organizes all existing data mixing approaches without significant omissions or overlaps.

What would settle it

A new data mixing technique that adjusts domain weights during pretraining in a manner that fits none of the four subcategories (rule-based, learning-based, adaptive, externally guided) or that cannot be reduced to the bilevel simplex formulation.

Figures

Figures reproduced from arXiv: 2604.16380 by Deyi Xiong, Supryadi, Yuxuan Miao, Zhuo Chen.

Figure 1
Figure 1. Figure 1: Taxonomy of data mixing. Data mixing Static Dynamic Rule-based Learning-based Adaptive Externally guided Proxy optimization￾based Prediction￾based Uniform Sampling Proportional Sampling Temperature￾Based Sampling UniMax [12] UtiliMax& MEDU [13] DoReMi [14] DoGE [15] DML [16] BiMix [17] AutoScale [18] Data Mixing Scaling Laws [19] RegMix [20] MixMin [21] MDE [22] MFMS-GP [23] ADMIRE [24] ODM [25] ADO [26] V… view at source ↗
Figure 2
Figure 2. Figure 2: Proxy optimization-based methods employ a proxy model as the optimization carrier. An iterative optimization algorithm is executed on the proxy model, which uses internal signals from its training process (e.g., losses and gradients) to drive the optimization and obtain an approximately optimal data mixture. The resulting data mixture is then directly applied to training the target main model [PITH_FULL_I… view at source ↗
Figure 3
Figure 3. Figure 3: Prediction-based methods assume that model performance is a function f of the chosen data mixture. Once this function f has been learned, one can derive the theoretically optimal data mixture. Such methods are typically instantiated in two ways: explicit prediction and implicit prediction. Explicit prediction methods model f as an explicit function and fit it using observations collected from a set of prox… view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic methods adjust the data mixture on the fly during training of the target main model. Depending on the signal used to drive these updates, they can be divided into two categories. Adaptive methods use internal signals from the target main model (e.g., losses and gradients) as the driving signal, whereas externally guided methods use an external controller to process training-produced signals and out… view at source ↗
read the original abstract

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper provides a comprehensive review of data mixing for LLM pretraining. It formalizes data mixture optimization as a bilevel problem on the probability simplex, introduces a taxonomy organizing methods into static (rule-based and learning-based) and dynamic (adaptive and externally guided) categories, summarizes representative approaches with their strengths and limitations from a performance-cost perspective, highlights cross-cutting challenges such as limited transferability and unstandardized evaluations, and outlines future directions including finer-grained domain partitioning and inverse data mixing.

Significance. This survey is significant as it addresses the fragmented literature on data mixing, which is critical for efficient LLM pretraining under data and compute budgets. The bilevel formalization offers a unifying framework, the taxonomy helps organize methods, and the analysis of challenges provides insights for future research. It gives credit to existing work by analyzing trade-offs.

major comments (1)
  1. [Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.

    Authors: We agree that an explicit mapping table would strengthen verifiability of the taxonomy's coverage. In the revised version we will add an appendix table that enumerates every paper reviewed in the survey and assigns it to one of the four leaf categories (rule-based, learning-based, adaptive, externally guided). This table will also note the primary data domains and optimization objectives used in each work, allowing readers to check for omissions and to assess the fine-grained partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard survey formalization and taxonomy

full rationale

This is a survey paper whose core contribution is an organizational taxonomy and a standard bilevel formalization of data mixing as optimization over the probability simplex. The bilevel framing is presented as a clarification of the pretraining pipeline role rather than a derivation from the paper's own fitted results or self-citations. The static/dynamic taxonomy partitions existing external methods without overlap claims that reduce to the authors' inputs, and the cross-cutting challenges are summarized directly from reviewed performance-cost analyses in the literature. No equations, predictions, or uniqueness theorems are shown to collapse by construction to the paper's own definitions or prior self-citations; all summarized approaches are attributed to independent prior work. The structure is self-contained as a review with no load-bearing internal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey paper, the central claim rests on the completeness of the literature review, the validity of the bilevel formalization as a unifying lens, and the assumption that the taxonomy captures the key distinctions in the field.

axioms (1)
  • domain assumption The literature on data mixing for LLM pretraining is fragmented and lacks a dedicated systematic survey.
    Stated directly in the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5587 in / 1295 out tokens · 41102 ms · 2026-05-15T00:54:00.377774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 8 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McC...

  2. [2]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  3. [3]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    X. Bi, Y. Zang, Y. Wang, Y. Mao, Y. Wang, Y. Guo, H. Liu, et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954 , 2024. [Online]. Available: https:// arxiv.org/abs/2401.02954

  4. [5]

    GPT-4 Technical Report

    [Online]. Available: https://arxiv.org/abs/2303.08774

  5. [6]

    FuxiTranyu: A multilingual large language model trained with balanced data,

    H. Sun, R. Jin, S. Xu, L. Pan, Supryadi, M. Cui, J. Du, Y. Lei, L. Yang, L. Shi, J. Xiao, S. Zhu, and D. Xiong, “FuxiTranyu: A multilingual large language model trained with balanced data,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , F. Dernoncourt, D. Preoţiuc-Pietro, and A. Shimorina, Eds. ...

  6. [7]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. A...

  7. [8]

    Datacomp-lm: In search of the next generation of training sets for language models,

    J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Y. Gadre, H. Bansal, E. K. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y. Bitton, M. Nezhurina, A. Abbas, C.-Y. Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, G. Daras, K. Marathe, A. Gokaslan, J. Zha...

  8. [9]

    Data, data everywhere: A guide for pretraining dataset construction,

    J. Parmar, S. Prabhumoye, J. Jennings, B. Liu, A. Jhunjhunwala, Z. Wang, M. Patwary, M. Shoeybi, and B. Catanzaro, “Data, data everywhere: A guide for pretraining dataset construction,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association...

  9. [10]

    A survey on data selection for language models,

    A. Albalak, Y. Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, C. Raffel, S. Chang, T. Hashimoto, and W. Y. Wang, “A survey on data selection for language models,” Transactions on Machine Learning Research , July 2024, tMLR (certified Survey Featured). [Online]. Available: https://openreview.net/forum?id=XfHWcNTSHp

  10. [11]

    A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,

    S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno, and D. Ippolito, “A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

  11. [12]

    GLaM: Efficient scaling of language models with mixture-of-experts,

    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proceedi...

  12. [14]

    Available: https://arxiv.org/abs/2304.09151

    [Online]. Available: https://arxiv.org/abs/2304.09151

  13. [15]

    Optimizing pretraining data mixtures with llm-estimated utility,

    W. Held, B. Paranjape, P. S. Koura, M. Lewis, F. Zhang, and T. Mihaylov, “Optimizing pretraining data mixtures with llm-estimated utility,” arXiv preprint arXiv:2501.11747 , 2025. [Online]. Available: https:// arxiv.org/abs/2501.11747

  14. [16]

    Doremi: Optimizing data mixtures speeds up language model pretraining,

    S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/ dcba6be91359358c2355cd920da3fcbd-Abstract-Conference.html

  15. [17]

    DOGE: Domain reweighting with generalization estimation,

    S. Fan, M. Pagliardini, and M. Jaggi, “DOGE: Domain reweighting with generalization estimation,” in Proceedings of the 41 st International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, Eds., vol. 235. PMLR, 21–27 Jul 2024, pp. 12...

  16. [18]

    Data mixing laws: Optimizing data mixtures by predicting language modeling performance,

    J. Ye, P. Liu, T. Sun, J. Zhan, Y. Zhou, and X. Qiu, “Data mixing laws: Optimizing data mixtures by predicting language modeling performance,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://openreview.net/forum?id=jjCB27TMK3

  17. [19]

    Bimix: A bivariate data mixing law for language model pretraining

    C. Ge, Z. Ma, D. Chen, Y. Li, and B. Ding, “Bimix: Bivariate data mixing law for language model pretraining,” arXiv preprint arXiv:2405.14908, 2024. [Online]. Available: https://arxiv.org/abs/2405.14908

  18. [20]

    Autoscale: Scale-aware data mixing for pre-training llms,

    F. Kang, Y. Sun, B. Wen, S. Chen, D. Song, R. Mahmood, and R. Jia, “Autoscale: Scale-aware data mixing for pre-training llms,” in Conference on Language Modeling (COLM 2025) , 2025. [Online]. Available: https://openreview.net/forum?id=rujwIvjooA. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 50

  19. [21]

    Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

    M. Shukor, L. Bethune, D. Busbridge, D. Grangier, E. Fini, A. El-Nouby, and P. Ablin, “Scaling laws for optimal data mixtures,” arXiv preprint arXiv:2507.09404 , 2025. [Online]. Available: https://arxiv.org/ abs/2507.09404

  20. [22]

    Regmix: Data mixture as regression for language model pre-training,

    Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin, “Regmix: Data mixture as regression for language model pre-training,” in International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://proceedings.iclr.cc/paper_files/paper/2025/ hash/ 5f67d864aae6115374fed7beddd119e0-Abstract-Conference.html

  21. [23]

    Mixmin: Finding data mixtures via convex minimization,

    A. Thudi, E. Rovers, Y. Ruan, T. Thrush, and C. J. Maddison, “Mixmin: Finding data mixtures via convex minimization,” in International Conference on Machine Learning (ICML 2025), Poster , 2025. [Online]. Available: https://openreview.net/forum?id=wpaxYGgp2n

  22. [24]

    Optimizing pre-training data mixtures with mixtures of data expert models,

    L. Belenki, A. Agarwal, T. Shi, and K. Toutanova, “Optimizing pre-training data mixtures with mixtures of data expert models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguisti...

  23. [25]

    Data mixture optimization: A multi-fidelity multi-scale bayesian framework,

    T. Yen, A. W. T. Siah, H. Chen, T. Peng, C. D. Guetta, and H. Namkoong, “Data mixture optimization: A multi-fidelity multi-scale bayesian framework,” in NeurIPS 2025, Poster, 2025. [Online]. Available: https:// openreview.net/forum?id=Kvsa8ZXd0W

  24. [27]

    Available: https://arxiv.org/abs/2508.11551

    [Online]. Available: https://arxiv.org/abs/2508.11551

  25. [28]

    arXiv preprint arXiv:2312.02406 , year=

    A. Albalak, L. Pan, C. Raffel, and W. Y. Wang, “Efficient online data mixing for language model pre- training,” arXiv preprint arXiv:2312.02406, 2023. [Online]. Available: https://arxiv.org/abs/2312.02406

  26. [29]

    Adaptive data optimization: Dynamic sample selection with scaling laws,

    Y. Jiang, A. Zhou, Z. Feng, S. Malladi, and J. Z. Kolter, “Adaptive data optimization: Dynamic sample selection with scaling laws,” in The Thirteenth International Conference on Learning Representations , 2025, iCLR 2025 Poster. [Online]. Available: https://iclr.cc/virtual/2025/poster/29145

  27. [30]

    Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,

    Z. Luo, X. Zhang, X. Liu, H. Li, Y. Gong, Q. Chen, and P. Cheng, “Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,” in Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 16644–...

  28. [31]

    Pike: Adaptive data mixing for multitask learning under low gradient conflicts,

    Z. Li, Y. Deng, P. Zhong, M. Razaviyayn, and V. Mirrokni, “Pike: Adaptive data mixing for multitask learning under low gradient conflicts,” arXiv preprint arXiv:2502.06244, 2025. [Online]. Available: https:// arxiv.org/abs/2502.06244

  29. [32]

    Grape: Optimize data mixture for group robust multi-target adaptive pretraining,

    S. Fan, M. I. Glarou, and M. Jaggi, “Grape: Optimize data mixture for group robust multi-target adaptive pretraining,” in Thirty-ninth Conference on Neural Information Processing Systems , 2025, neurIPS 2025 Poster. [Online]. Available: https://openreview.net/forum?id=JRmIvBcnWc

  30. [33]

    Actor-critic based online data mixing for language model pre-training,

    J. Ma, C. Dang, and M. Liao, “Actor-critic based online data mixing for language model pre-training,” arXiv preprint arXiv:2505.23878, 2025. [Online]. Available: https://arxiv.org/abs/2505.23878

  31. [34]

    Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

    K. Yang, X. Liu, L. Ji, H. Li, Y. Gong, P. Cheng, and M. Yang, “Data mixing agent: Learning to re-weight domains for continual pre-training,” arXiv preprint arXiv:2507.15640 , 2025. [Online]. Available: https:// arxiv.org/abs/2507.15640

  32. [35]

    Tikmix: Take data influence into dynamic mixture for language model pre-training,

    Y. Wang, B. Liu, F. Liu, Y. Guo, J. Deng, X. Wu, W. Zhou, X. Zhou, and T. Wang, “Tikmix: Take data influence into dynamic mixture for language model pre-training,” arXiv preprint arXiv:2508.17677, 2025. [Online]. Available: https://arxiv.org/abs/2508.17677. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 51

  33. [36]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Else...

  34. [37]

    OLMo: Accelerating the science of language models,

    D. Groeneveld, I. Beltagy, E. P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. H. Saunders...

  35. [38]

    Bert: Pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 201...

  36. [39]

    Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,

    F. Luo, W. Li, Y. Liu, B. Huang, Z. Wang, J. Guo, X.-L. Sun, and W.-Y. Liu, “Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,” in Proceedings of the 59 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Aug. 2021, pp. 3980–3993. [Online]. ...

  37. [40]

    Unsupervised cross-lingual representation learning at scale,

    A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Jul. 2020, pp. 8440–8451....

  38. [41]

    Xlm-e: Cross-lingual language model pre-training via ELECTRA,

    Z. Chi, L. Dong, F. Wei, W. Wang, N. Yang, S. Singhal, S. Wang, X. Song, S. Ma, S. Huang, M. Zhou, and F. Wei, “Xlm-e: Cross-lingual language model pre-training via ELECTRA,” in Proceedings of the 60 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2...

  39. [42]

    Multilingual denoising pre-training for neural machine translation,

    Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

  40. [43]

    Training sparse mixture of experts text embedding models,

    Z. Nussbaum and B. Duderstadt, “Training sparse mixture of experts text embedding models,” arXiv preprint arXiv:2502.07972, 2025. [Online]. Available: https://arxiv.org/abs/2502.07972. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 52

  41. [44]

    mt5: A massively multilingual pre-trained text-to-text transformer,

    L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguisti...

  42. [45]

    mmbert: A modern multilingual encoder with annealed language learning,

    M. Marone, O. Weller, W. Fleshman, E. Yang, D. Lawrie, and B. Van Durme, “mmbert: A modern multilingual encoder with annealed language learning,” arXiv preprint arXiv:2509.06888, 2025. [Online]. Available: https://arxiv.org/abs/2509.06888

  43. [46]

    Portfolio selection,

    H. Markowitz, “Portfolio selection,” The Journal of Finance , vol. 7, no. 1, pp. 77–91, March 1952. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1952.tb01525.x

  44. [47]

    Markowitz portfolio construction at seventy,

    S. Boyd, K. Johansson, R. Kahn, P. Schiele, and T. Schmelzer, “Markowitz portfolio construction at seventy,” The Journal of Portfolio Management, vol. 50, no. 8, pp. 117–160, 2024, special Issue Dedicated to Harry Markowitz. [Online]. Available: https://www.pm-research.com/content/iijpormgmt/50/8/117

  45. [48]

    Robust stochastic approximation approach to stochastic programming,

    A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization , vol. 19, no. 4, pp. 1574–1609, 2009. [Online]. Available: https://epubs.siam.org/doi/10.1137/070704277

  46. [49]

    Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,

    S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2020. [Online]. Available: https:// openreview.net/forum?id=ryxGuJrFvS

  47. [50]

    Distributionally robust language modeling,

    Y. Oren, S. Sagawa, T. B. Hashimoto, and P. Liang, “Distributionally robust language modeling,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9 th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 422...

  48. [51]

    Prioritized training on points that are learnable, worth learning, and not yet learnt,

    S. Mindermann, A. Bengs, J. Hooper, Y. Gal, and A. Weller, “Prioritized training on points that are learnable, worth learning, and not yet learnt,” in Proceedings of the 39 th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162....

  49. [52]

    An empirical analysis of compute-optimal large language model training,

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. Rae, O. Vinyals, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” in Adv...

  50. [53]

    Lightgbm: A highly efficient gradient boosting decision tree,

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 3146–3154. [Online]. Available: https://papers.nips.cc/paper_files/paper/2017/hash/ 6449f44a102fde848669bdd9eb6b76fa-Abstract.html

  51. [54]

    C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. [Online]. Available: https://direct.mit.edu/books/oa-monograph/2320/Gaussian-Processes-for-Machine-Learning

  52. [55]

    Multivariable functional interpolation and adaptive networks,

    D. S. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Systems, vol. 2, no. 3, pp. 321–355, 1988. [Online]. Available: https://www.complex-systems.com/ abstracts/v02 i03 a05/. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 53

  53. [56]

    Practical multi-fidelity bayesian optimization for hyperparameter tuning,

    J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson, “Practical multi-fidelity bayesian optimization for hyperparameter tuning,” in Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, ser. Proceedings of Machine Learning Research, R. P. Adams and V. Gogate, Eds., vol. 115. PMLR, 22–25 Jul 2020, pp. 788–798. [Online]. Availab...

  54. [57]

    The nonstochastic multiarmed bandit problem,

    P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002. [Online]. Available: https: //epubs.siam.org/ doi/10.1137/S0097539701398375

  55. [58]

    Neuronlike adaptive elements that can solve difficult learning control problems,

    A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics , vol. SMC-13, no. 5, pp. 834–846, 1983. [Online]. Available: https://incompleteideas.net/papers/barto-sutton-anderson-83.pdf

  56. [59]

    Conservative q-learning for offline reinforcement learning,

    A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” in Advances in Neural Information Processing Systems , vol. 33, 2020. [Online]. Available: https:// proceedings.neurips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html

  57. [60]

    Aioli: A unified optimization framework for language model data mixing,

    M. F. Chen, M. Y. Hu, N. Lourie, K. Cho, and C. Ré, “Aioli: A unified optimization framework for language model data mixing,” in International Conference on Learning Representations (ICLR) , 2025. [Online]. Available: https://openreview.net/forum?id=sZGZJhaNSe

  58. [61]

    Unsupervised topic models are data mixers for pre-training language models,

    J. Peng, X. Zhuang, J. Qiu, R. Ma, J. Yu, T. Bai, and C. He, “Unsupervised topic models are data mixers for pre-training language models,” arXiv preprint arXiv:2502.16802 , 2025. [Online]. Available: https://arxiv. org/abs/2502.16802

  59. [62]

    Some methods for classification and analysis of multivariate observations,

    J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, L. M. Le Cam and J. Neyman, Eds. Berkeley, CA: University of California Press, 1967, pp. 281–297. [Online]. Available: https://projecteuclid.org/ebooks...

  60. [63]

    k-means++: The advantages of careful seeding,

    D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07) . New Orleans, Louisiana, USA: Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035. [Online]. Available: https:// dl.acm.org/doi/10.5555/1283383.1283494

  61. [64]

    Climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161,

    S. Diao, Y. Yang, Y. Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y. Suhara, H. Yin, M. Patwary, Y. Lin, J. Kautz, and P. Molchanov, “Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training,” arXiv preprint arXiv:2504.13161, 2025. [Online]. Available: https://arxiv.org/ abs/2504.13161

  62. [65]

    Domain2vec: Vectorizing datasets to find the optimal data mixture without training,

    M. Zhang, H. Tissue, L. Wang, and X. Qiu, “Domain2vec: Vectorizing datasets to find the optimal data mixture without training,” arXiv preprint arXiv:2506.10952, 2025. [Online]. Available: https://arxiv.org/ abs/2506.10952

  63. [66]

    Balanced data sampling for language model training with clustering,

    Y. Shao, L. Li, Z. Fei, H. Yan, D. Lin, and X. Qiu, “Balanced data sampling for language model training with clustering,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V. Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 14 012–14 023. [Online]. Available: https...

  64. [67]

    Dynamic gradient alignment for online data mixing,

    S. Fan, D. Grangier, and P. Ablin, “Dynamic gradient alignment for online data mixing,” OpenReview preprint, 2025, iCLR 2025 submission. [Online]. Available: https://openreview.net/forum?id=O3SatrdL97

  65. [68]

    Dids: Domain impact-aware data sampling for large language model training,

    W. Shi, J. Zhang, Y. Wu, J. Fang, S. Zhang, Y. Zhao, H. Chen, R. Zhang, Y. Cui, J. Zhu, S. Han, J. Xu, and X. Zhou, “Dids: Domain impact-aware data sampling for large language model training,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , C. Christodoulopoulos, T. Data Mixing for Large Language Models Pretrain...

  66. [69]

    Toremi: Topic-aware data reweighting for dynamic pre-training data selection,

    X. Zhu, Z. Gu, S. Zheng, T. Wang, T. Li, H. Feng, and Y. Xiao, “Toremi: Topic-aware data reweighting for dynamic pre-training data selection,” arXiv preprint arXiv:2504.00695, 2025. [Online]. Available: https:// arxiv.org/abs/2504.00695

  67. [70]

    R&b: Domain regrouping and data mixture balancing for efficient foundation model training,

    A. Ge, T.-H. Huang, J. Cooper, A. Trost, Z. Chu, S. S. S. Namburi GNVV, Z. Cai, K. Park, N. Roberts, and F. Sala, “R&b: Domain regrouping and data mixture balancing for efficient foundation model training,” arXiv preprint arXiv:2505.00358, 2025. [Online]. Available: https://arxiv.org/abs/2505.00358

  68. [71]

    Skill-it! a data-driven skills framework for understanding and training language models,

    M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/70b8505ac 79e3e131756f793cd80eb8d-Abstract-Conference.html

  69. [72]

    Ideal: Data equilibrium adaptation for multi-capability language model alignment,

    C. Ming, C. Qu, M. Cai, Q. Pei, Z. Pan, Y. Li, X. Duan, L. Wu, and C. He, “Ideal: Data equilibrium adaptation for multi-capability language model alignment,” arXiv preprint arXiv:2505.12762, 2025. [Online]. Available: https://arxiv.org/abs/2505.12762

  70. [73]

    AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,

    N. E. Corrado, J. Katz-Samuels, A. M. Devraj, H. Yun, C. Zhang, Y. Xu, Y. Pan, B. Yin, and T. Chilimbi, “AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, ...

  71. [74]

    Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,

    W. Xie, F. Tonin, and V. Cevher, “Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,” arXiv preprint arXiv:2505.24844 , 2025. [Online]. Available: https://arxiv.org/ abs/2505.24844

  72. [75]

    Sharp analysis of low-rank kernel matrix approximations,

    F. Bach, “Sharp analysis of low-rank kernel matrix approximations,” in Proceedings of the 26 th Annual Conference on Learning Theory , ser. Proceedings of Machine Learning Research, S. Shalev-Shwartz and I. Steinwart, Eds., vol. 30. Princeton, NJ, USA: PMLR, 12–14 Jun 2013, pp. 185–209. [Online]. Available: https://proceedings.mlr.press/v30/Bach13.html

  73. [76]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

  74. [77]

    SampleMix: A sample-wise pre-training data mixing strategy by coordinating data quality and diversity,

    X. Xi, D. Kong, J. Yang, J. Yang, Z. Chen, W. Wang, J. Wang, X. Cai, S. Zhang, and W. Ye, “SampleMix: A sample-wise pre-training data mixing strategy by coordinating data quality and diversity,” in Findings of the Association for Computational Linguistics: EMNLP 2025 , C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, Eds. Suzhou, China: Associ...

  75. [78]

    DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

    Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low, “Duet: Optimizing training data mixtures via feedback from unseen evaluation tasks,” arXiv preprint arXiv:2502.00270 , 2025. [Online]. Available: https://arxiv. org/abs/2502.00270

  76. [79]

    Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,

    F. Liu, W. Zhou, B. Liu, Z. Yu, Y. Zhang, H. Lin, Y. Yu, X. Zhou, T. Wang, and Y. Cao, “Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,” arXiv preprint arXiv:2504.16511 , 2025. [Online]. Available: https://arxiv.org/abs/2504.16511

  77. [80]

    Data mixture inference attack: BPE tokenizers reveal training data compositions,

    J. Hayase, A. Liu, Y. Choi, S. Oh, and N. A. Smith, “Data mixture inference attack: BPE tokenizers reveal training data compositions,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 55

  78. [81]

    Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/10e6dfea9a673bef4 a7b1cb9234891bc-Abstract-Conference.html

    [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/10e6dfea9a673bef4 a7b1cb9234891bc-Abstract-Conference.html

  79. [82]

    Neural machine translation of rare words with subword units,

    R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online]. Available: htt...

  80. [83]

    Data proportion detection for optimized data management for large language models,

    H. Liang, K. Zhao, Y. Yang, B. Cui, G. Dong, Z. Zhou, and W. Zhang, “Data proportion detection for optimized data management for large language models,” arXiv preprint arXiv:2409.17527, 2024. [Online]. Available: https://arxiv.org/abs/2409.17527