Recognition: no theorem link
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
Pith reviewed 2026-05-15 00:54 UTC · model grok-4.3
The pith
Data mixing for LLM pretraining is formalized as a bilevel optimization problem on the probability simplex and classified into static and dynamic methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This survey formalizes data mixture optimization as a bilevel problem on the probability simplex and introduces a fine-grained taxonomy that divides existing methods into static mixing, further split into rule-based and learning-based, and dynamic mixing, grouped into adaptive and externally guided families. For each category it summarizes representative approaches and analyzes their performance-cost trade-offs. It identifies cross-cutting challenges including limited transferability across domains, models, and validation sets, plus unstandardized benchmarks, and proposes future directions such as finer-grained domain partitioning and pipeline-aware designs.
What carries the argument
The bilevel optimization formulation on the probability simplex, which treats domain weights as upper-level variables optimized to make the inner model training process tractable under fixed budgets.
Load-bearing premise
That the taxonomy dividing methods into static versus dynamic categories along with their subfamilies comprehensively organizes all existing data mixing approaches without significant omissions or overlaps.
What would settle it
A new data mixing technique that adjusts domain weights during pretraining in a manner that fits none of the four subcategories (rule-based, learning-based, adaptive, externally guided) or that cannot be reduced to the bilevel simplex formulation.
Figures
read the original abstract
Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides a comprehensive review of data mixing for LLM pretraining. It formalizes data mixture optimization as a bilevel problem on the probability simplex, introduces a taxonomy organizing methods into static (rule-based and learning-based) and dynamic (adaptive and externally guided) categories, summarizes representative approaches with their strengths and limitations from a performance-cost perspective, highlights cross-cutting challenges such as limited transferability and unstandardized evaluations, and outlines future directions including finer-grained domain partitioning and inverse data mixing.
Significance. This survey is significant as it addresses the fragmented literature on data mixing, which is critical for efficient LLM pretraining under data and compute budgets. The bilevel formalization offers a unifying framework, the taxonomy helps organize methods, and the analysis of challenges provides insights for future research. It gives credit to existing work by analyzing trade-offs.
major comments (1)
- [Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.
Authors: We agree that an explicit mapping table would strengthen verifiability of the taxonomy's coverage. In the revised version we will add an appendix table that enumerates every paper reviewed in the survey and assigns it to one of the four leaf categories (rule-based, learning-based, adaptive, externally guided). This table will also note the primary data domains and optimization objectives used in each work, allowing readers to check for omissions and to assess the fine-grained partitioning. revision: yes
Circularity Check
No significant circularity: standard survey formalization and taxonomy
full rationale
This is a survey paper whose core contribution is an organizational taxonomy and a standard bilevel formalization of data mixing as optimization over the probability simplex. The bilevel framing is presented as a clarification of the pretraining pipeline role rather than a derivation from the paper's own fitted results or self-citations. The static/dynamic taxonomy partitions existing external methods without overlap claims that reduce to the authors' inputs, and the cross-cutting challenges are summarized directly from reviewed performance-cost analyses in the literature. No equations, predictions, or uniqueness theorems are shown to collapse by construction to the paper's own definitions or prior self-citations; all summarized approaches are attributed to independent prior work. The structure is self-contained as a review with no load-bearing internal derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The literature on data mixing for LLM pretraining is fragmented and lacks a dedicated systematic survey.
Reference graph
Works this paper leans on
-
[1]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McC...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
X. Bi, Y. Zang, Y. Wang, Y. Mao, Y. Wang, Y. Guo, H. Liu, et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954 , 2024. [Online]. Available: https:// arxiv.org/abs/2401.02954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
[Online]. Available: https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
FuxiTranyu: A multilingual large language model trained with balanced data,
H. Sun, R. Jin, S. Xu, L. Pan, Supryadi, M. Cui, J. Du, Y. Lei, L. Yang, L. Shi, J. Xiao, S. Zhu, and D. Xiong, “FuxiTranyu: A multilingual large language model trained with balanced data,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , F. Dernoncourt, D. Preoţiuc-Pietro, and A. Shimorina, Eds. ...
work page 2024
-
[7]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. A...
work page 2017
-
[8]
Datacomp-lm: In search of the next generation of training sets for language models,
J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Y. Gadre, H. Bansal, E. K. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y. Bitton, M. Nezhurina, A. Abbas, C.-Y. Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, G. Daras, K. Marathe, A. Gokaslan, J. Zha...
work page 2024
-
[9]
Data, data everywhere: A guide for pretraining dataset construction,
J. Parmar, S. Prabhumoye, J. Jennings, B. Liu, A. Jhunjhunwala, Z. Wang, M. Patwary, M. Shoeybi, and B. Catanzaro, “Data, data everywhere: A guide for pretraining dataset construction,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association...
work page 2024
-
[10]
A survey on data selection for language models,
A. Albalak, Y. Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, C. Raffel, S. Chang, T. Hashimoto, and W. Y. Wang, “A survey on data selection for language models,” Transactions on Machine Learning Research , July 2024, tMLR (certified Survey Featured). [Online]. Available: https://openreview.net/forum?id=XfHWcNTSHp
work page 2024
-
[11]
S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno, and D. Ippolito, “A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...
work page 2024
-
[12]
GLaM: Efficient scaling of language models with mixture-of-experts,
N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proceedi...
work page 2022
-
[14]
Available: https://arxiv.org/abs/2304.09151
[Online]. Available: https://arxiv.org/abs/2304.09151
-
[15]
Optimizing pretraining data mixtures with llm-estimated utility,
W. Held, B. Paranjape, P. S. Koura, M. Lewis, F. Zhang, and T. Mihaylov, “Optimizing pretraining data mixtures with llm-estimated utility,” arXiv preprint arXiv:2501.11747 , 2025. [Online]. Available: https:// arxiv.org/abs/2501.11747
-
[16]
Doremi: Optimizing data mixtures speeds up language model pretraining,
S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/ dcba6be91359358c2355cd920da3fcbd-Abstract-Conference.html
work page 2023
-
[17]
DOGE: Domain reweighting with generalization estimation,
S. Fan, M. Pagliardini, and M. Jaggi, “DOGE: Domain reweighting with generalization estimation,” in Proceedings of the 41 st International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, Eds., vol. 235. PMLR, 21–27 Jul 2024, pp. 12...
work page 2024
-
[18]
Data mixing laws: Optimizing data mixtures by predicting language modeling performance,
J. Ye, P. Liu, T. Sun, J. Zhan, Y. Zhou, and X. Qiu, “Data mixing laws: Optimizing data mixtures by predicting language modeling performance,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://openreview.net/forum?id=jjCB27TMK3
work page 2025
-
[19]
Bimix: A bivariate data mixing law for language model pretraining
C. Ge, Z. Ma, D. Chen, Y. Li, and B. Ding, “Bimix: Bivariate data mixing law for language model pretraining,” arXiv preprint arXiv:2405.14908, 2024. [Online]. Available: https://arxiv.org/abs/2405.14908
-
[20]
Autoscale: Scale-aware data mixing for pre-training llms,
F. Kang, Y. Sun, B. Wen, S. Chen, D. Song, R. Mahmood, and R. Jia, “Autoscale: Scale-aware data mixing for pre-training llms,” in Conference on Language Modeling (COLM 2025) , 2025. [Online]. Available: https://openreview.net/forum?id=rujwIvjooA. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 50
work page 2025
-
[21]
Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
M. Shukor, L. Bethune, D. Busbridge, D. Grangier, E. Fini, A. El-Nouby, and P. Ablin, “Scaling laws for optimal data mixtures,” arXiv preprint arXiv:2507.09404 , 2025. [Online]. Available: https://arxiv.org/ abs/2507.09404
-
[22]
Regmix: Data mixture as regression for language model pre-training,
Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin, “Regmix: Data mixture as regression for language model pre-training,” in International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://proceedings.iclr.cc/paper_files/paper/2025/ hash/ 5f67d864aae6115374fed7beddd119e0-Abstract-Conference.html
work page 2025
-
[23]
Mixmin: Finding data mixtures via convex minimization,
A. Thudi, E. Rovers, Y. Ruan, T. Thrush, and C. J. Maddison, “Mixmin: Finding data mixtures via convex minimization,” in International Conference on Machine Learning (ICML 2025), Poster , 2025. [Online]. Available: https://openreview.net/forum?id=wpaxYGgp2n
work page 2025
-
[24]
Optimizing pre-training data mixtures with mixtures of data expert models,
L. Belenki, A. Agarwal, T. Shi, and K. Toutanova, “Optimizing pre-training data mixtures with mixtures of data expert models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguisti...
work page 2025
-
[25]
Data mixture optimization: A multi-fidelity multi-scale bayesian framework,
T. Yen, A. W. T. Siah, H. Chen, T. Peng, C. D. Guetta, and H. Namkoong, “Data mixture optimization: A multi-fidelity multi-scale bayesian framework,” in NeurIPS 2025, Poster, 2025. [Online]. Available: https:// openreview.net/forum?id=Kvsa8ZXd0W
work page 2025
-
[27]
Available: https://arxiv.org/abs/2508.11551
[Online]. Available: https://arxiv.org/abs/2508.11551
-
[28]
arXiv preprint arXiv:2312.02406 , year=
A. Albalak, L. Pan, C. Raffel, and W. Y. Wang, “Efficient online data mixing for language model pre- training,” arXiv preprint arXiv:2312.02406, 2023. [Online]. Available: https://arxiv.org/abs/2312.02406
-
[29]
Adaptive data optimization: Dynamic sample selection with scaling laws,
Y. Jiang, A. Zhou, Z. Feng, S. Malladi, and J. Z. Kolter, “Adaptive data optimization: Dynamic sample selection with scaling laws,” in The Thirteenth International Conference on Learning Representations , 2025, iCLR 2025 Poster. [Online]. Available: https://iclr.cc/virtual/2025/poster/29145
work page 2025
-
[30]
Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,
Z. Luo, X. Zhang, X. Liu, H. Li, Y. Gong, Q. Chen, and P. Cheng, “Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,” in Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 16644–...
work page 2025
-
[31]
Pike: Adaptive data mixing for multitask learning under low gradient conflicts,
Z. Li, Y. Deng, P. Zhong, M. Razaviyayn, and V. Mirrokni, “Pike: Adaptive data mixing for multitask learning under low gradient conflicts,” arXiv preprint arXiv:2502.06244, 2025. [Online]. Available: https:// arxiv.org/abs/2502.06244
-
[32]
Grape: Optimize data mixture for group robust multi-target adaptive pretraining,
S. Fan, M. I. Glarou, and M. Jaggi, “Grape: Optimize data mixture for group robust multi-target adaptive pretraining,” in Thirty-ninth Conference on Neural Information Processing Systems , 2025, neurIPS 2025 Poster. [Online]. Available: https://openreview.net/forum?id=JRmIvBcnWc
work page 2025
-
[33]
Actor-critic based online data mixing for language model pre-training,
J. Ma, C. Dang, and M. Liao, “Actor-critic based online data mixing for language model pre-training,” arXiv preprint arXiv:2505.23878, 2025. [Online]. Available: https://arxiv.org/abs/2505.23878
-
[34]
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
K. Yang, X. Liu, L. Ji, H. Li, Y. Gong, P. Cheng, and M. Yang, “Data mixing agent: Learning to re-weight domains for continual pre-training,” arXiv preprint arXiv:2507.15640 , 2025. [Online]. Available: https:// arxiv.org/abs/2507.15640
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Tikmix: Take data influence into dynamic mixture for language model pre-training,
Y. Wang, B. Liu, F. Liu, Y. Guo, J. Deng, X. Wu, W. Zhou, X. Zhou, and T. Wang, “Tikmix: Take data influence into dynamic mixture for language model pre-training,” arXiv preprint arXiv:2508.17677, 2025. [Online]. Available: https://arxiv.org/abs/2508.17677. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 51
-
[36]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Else...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
OLMo: Accelerating the science of language models,
D. Groeneveld, I. Beltagy, E. P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. H. Saunders...
work page 2024
-
[38]
Bert: Pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 201...
work page 2019
-
[39]
Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,
F. Luo, W. Li, Y. Liu, B. Huang, Z. Wang, J. Guo, X.-L. Sun, and W.-Y. Liu, “Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,” in Proceedings of the 59 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Aug. 2021, pp. 3980–3993. [Online]. ...
work page 2021
-
[40]
Unsupervised cross-lingual representation learning at scale,
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Jul. 2020, pp. 8440–8451....
work page 2020
-
[41]
Xlm-e: Cross-lingual language model pre-training via ELECTRA,
Z. Chi, L. Dong, F. Wei, W. Wang, N. Yang, S. Singhal, S. Wang, X. Song, S. Ma, S. Huang, M. Zhou, and F. Wei, “Xlm-e: Cross-lingual language model pre-training via ELECTRA,” in Proceedings of the 60 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2...
work page 2022
-
[42]
Multilingual denoising pre-training for neural machine translation,
Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/
work page 2020
-
[43]
Training sparse mixture of experts text embedding models,
Z. Nussbaum and B. Duderstadt, “Training sparse mixture of experts text embedding models,” arXiv preprint arXiv:2502.07972, 2025. [Online]. Available: https://arxiv.org/abs/2502.07972. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 52
-
[44]
mt5: A massively multilingual pre-trained text-to-text transformer,
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguisti...
work page 2021
-
[45]
mmbert: A modern multilingual encoder with annealed language learning,
M. Marone, O. Weller, W. Fleshman, E. Yang, D. Lawrie, and B. Van Durme, “mmbert: A modern multilingual encoder with annealed language learning,” arXiv preprint arXiv:2509.06888, 2025. [Online]. Available: https://arxiv.org/abs/2509.06888
-
[46]
H. Markowitz, “Portfolio selection,” The Journal of Finance , vol. 7, no. 1, pp. 77–91, March 1952. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1952.tb01525.x
-
[47]
Markowitz portfolio construction at seventy,
S. Boyd, K. Johansson, R. Kahn, P. Schiele, and T. Schmelzer, “Markowitz portfolio construction at seventy,” The Journal of Portfolio Management, vol. 50, no. 8, pp. 117–160, 2024, special Issue Dedicated to Harry Markowitz. [Online]. Available: https://www.pm-research.com/content/iijpormgmt/50/8/117
work page 2024
-
[48]
Robust stochastic approximation approach to stochastic programming,
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization , vol. 19, no. 4, pp. 1574–1609, 2009. [Online]. Available: https://epubs.siam.org/doi/10.1137/070704277
-
[49]
S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2020. [Online]. Available: https:// openreview.net/forum?id=ryxGuJrFvS
work page 2020
-
[50]
Distributionally robust language modeling,
Y. Oren, S. Sagawa, T. B. Hashimoto, and P. Liang, “Distributionally robust language modeling,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9 th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 422...
work page 2019
-
[51]
Prioritized training on points that are learnable, worth learning, and not yet learnt,
S. Mindermann, A. Bengs, J. Hooper, Y. Gal, and A. Weller, “Prioritized training on points that are learnable, worth learning, and not yet learnt,” in Proceedings of the 39 th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162....
work page 2022
-
[52]
An empirical analysis of compute-optimal large language model training,
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. Rae, O. Vinyals, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” in Adv...
work page 2022
-
[53]
Lightgbm: A highly efficient gradient boosting decision tree,
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 3146–3154. [Online]. Available: https://papers.nips.cc/paper_files/paper/2017/hash/ 6449f44a102fde848669bdd9eb6b76fa-Abstract.html
work page 2017
-
[54]
C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. [Online]. Available: https://direct.mit.edu/books/oa-monograph/2320/Gaussian-Processes-for-Machine-Learning
work page 2006
-
[55]
Multivariable functional interpolation and adaptive networks,
D. S. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Systems, vol. 2, no. 3, pp. 321–355, 1988. [Online]. Available: https://www.complex-systems.com/ abstracts/v02 i03 a05/. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 53
work page 1988
-
[56]
Practical multi-fidelity bayesian optimization for hyperparameter tuning,
J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson, “Practical multi-fidelity bayesian optimization for hyperparameter tuning,” in Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, ser. Proceedings of Machine Learning Research, R. P. Adams and V. Gogate, Eds., vol. 115. PMLR, 22–25 Jul 2020, pp. 788–798. [Online]. Availab...
work page 2020
-
[57]
The nonstochastic multiarmed bandit problem,
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002. [Online]. Available: https: //epubs.siam.org/ doi/10.1137/S0097539701398375
-
[58]
Neuronlike adaptive elements that can solve difficult learning control problems,
A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics , vol. SMC-13, no. 5, pp. 834–846, 1983. [Online]. Available: https://incompleteideas.net/papers/barto-sutton-anderson-83.pdf
work page 1983
-
[59]
Conservative q-learning for offline reinforcement learning,
A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” in Advances in Neural Information Processing Systems , vol. 33, 2020. [Online]. Available: https:// proceedings.neurips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html
work page 2020
-
[60]
Aioli: A unified optimization framework for language model data mixing,
M. F. Chen, M. Y. Hu, N. Lourie, K. Cho, and C. Ré, “Aioli: A unified optimization framework for language model data mixing,” in International Conference on Learning Representations (ICLR) , 2025. [Online]. Available: https://openreview.net/forum?id=sZGZJhaNSe
work page 2025
-
[61]
Unsupervised topic models are data mixers for pre-training language models,
J. Peng, X. Zhuang, J. Qiu, R. Ma, J. Yu, T. Bai, and C. He, “Unsupervised topic models are data mixers for pre-training language models,” arXiv preprint arXiv:2502.16802 , 2025. [Online]. Available: https://arxiv. org/abs/2502.16802
-
[62]
Some methods for classification and analysis of multivariate observations,
J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, L. M. Le Cam and J. Neyman, Eds. Berkeley, CA: University of California Press, 1967, pp. 281–297. [Online]. Available: https://projecteuclid.org/ebooks...
-
[63]
k-means++: The advantages of careful seeding,
D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07) . New Orleans, Louisiana, USA: Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035. [Online]. Available: https:// dl.acm.org/doi/10.5555/1283383.1283494
-
[64]
S. Diao, Y. Yang, Y. Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y. Suhara, H. Yin, M. Patwary, Y. Lin, J. Kautz, and P. Molchanov, “Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training,” arXiv preprint arXiv:2504.13161, 2025. [Online]. Available: https://arxiv.org/ abs/2504.13161
-
[65]
Domain2vec: Vectorizing datasets to find the optimal data mixture without training,
M. Zhang, H. Tissue, L. Wang, and X. Qiu, “Domain2vec: Vectorizing datasets to find the optimal data mixture without training,” arXiv preprint arXiv:2506.10952, 2025. [Online]. Available: https://arxiv.org/ abs/2506.10952
-
[66]
Balanced data sampling for language model training with clustering,
Y. Shao, L. Li, Z. Fei, H. Yan, D. Lin, and X. Qiu, “Balanced data sampling for language model training with clustering,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V. Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 14 012–14 023. [Online]. Available: https...
work page 2024
-
[67]
Dynamic gradient alignment for online data mixing,
S. Fan, D. Grangier, and P. Ablin, “Dynamic gradient alignment for online data mixing,” OpenReview preprint, 2025, iCLR 2025 submission. [Online]. Available: https://openreview.net/forum?id=O3SatrdL97
work page 2025
-
[68]
Dids: Domain impact-aware data sampling for large language model training,
W. Shi, J. Zhang, Y. Wu, J. Fang, S. Zhang, Y. Zhao, H. Chen, R. Zhang, Y. Cui, J. Zhu, S. Han, J. Xu, and X. Zhou, “Dids: Domain impact-aware data sampling for large language model training,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , C. Christodoulopoulos, T. Data Mixing for Large Language Models Pretrain...
work page 2025
-
[69]
Toremi: Topic-aware data reweighting for dynamic pre-training data selection,
X. Zhu, Z. Gu, S. Zheng, T. Wang, T. Li, H. Feng, and Y. Xiao, “Toremi: Topic-aware data reweighting for dynamic pre-training data selection,” arXiv preprint arXiv:2504.00695, 2025. [Online]. Available: https:// arxiv.org/abs/2504.00695
-
[70]
R&b: Domain regrouping and data mixture balancing for efficient foundation model training,
A. Ge, T.-H. Huang, J. Cooper, A. Trost, Z. Chu, S. S. S. Namburi GNVV, Z. Cai, K. Park, N. Roberts, and F. Sala, “R&b: Domain regrouping and data mixture balancing for efficient foundation model training,” arXiv preprint arXiv:2505.00358, 2025. [Online]. Available: https://arxiv.org/abs/2505.00358
-
[71]
Skill-it! a data-driven skills framework for understanding and training language models,
M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/70b8505ac 79e3e131756f793cd80eb8d-Abstract-Conference.html
work page 2023
-
[72]
Ideal: Data equilibrium adaptation for multi-capability language model alignment,
C. Ming, C. Qu, M. Cai, Q. Pei, Z. Pan, Y. Li, X. Duan, L. Wu, and C. He, “Ideal: Data equilibrium adaptation for multi-capability language model alignment,” arXiv preprint arXiv:2505.12762, 2025. [Online]. Available: https://arxiv.org/abs/2505.12762
-
[73]
AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,
N. E. Corrado, J. Katz-Samuels, A. M. Devraj, H. Yun, C. Zhang, Y. Xu, Y. Pan, B. Yin, and T. Chilimbi, “AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, ...
work page 2025
-
[74]
Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,
W. Xie, F. Tonin, and V. Cevher, “Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,” arXiv preprint arXiv:2505.24844 , 2025. [Online]. Available: https://arxiv.org/ abs/2505.24844
-
[75]
Sharp analysis of low-rank kernel matrix approximations,
F. Bach, “Sharp analysis of low-rank kernel matrix approximations,” in Proceedings of the 26 th Annual Conference on Learning Theory , ser. Proceedings of Machine Learning Research, S. Shalev-Shwartz and I. Steinwart, Eds., vol. 30. Princeton, NJ, USA: PMLR, 12–14 Jun 2013, pp. 185–209. [Online]. Available: https://proceedings.mlr.press/v30/Bach13.html
work page 2013
-
[76]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[77]
X. Xi, D. Kong, J. Yang, J. Yang, Z. Chen, W. Wang, J. Wang, X. Cai, S. Zhang, and W. Ye, “SampleMix: A sample-wise pre-training data mixing strategy by coordinating data quality and diversity,” in Findings of the Association for Computational Linguistics: EMNLP 2025 , C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, Eds. Suzhou, China: Associ...
work page 2025
-
[78]
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low, “Duet: Optimizing training data mixtures via feedback from unseen evaluation tasks,” arXiv preprint arXiv:2502.00270 , 2025. [Online]. Available: https://arxiv. org/abs/2502.00270
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,
F. Liu, W. Zhou, B. Liu, Z. Yu, Y. Zhang, H. Lin, Y. Yu, X. Zhou, T. Wang, and Y. Cao, “Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,” arXiv preprint arXiv:2504.16511 , 2025. [Online]. Available: https://arxiv.org/abs/2504.16511
-
[80]
Data mixture inference attack: BPE tokenizers reveal training data compositions,
J. Hayase, A. Liu, Y. Choi, S. Oh, and N. A. Smith, “Data mixture inference attack: BPE tokenizers reveal training data compositions,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 55
work page 2024
-
[81]
[Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/10e6dfea9a673bef4 a7b1cb9234891bc-Abstract-Conference.html
work page 2024
-
[82]
Neural machine translation of rare words with subword units,
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online]. Available: htt...
work page 2016
-
[83]
Data proportion detection for optimized data management for large language models,
H. Liang, K. Zhao, Y. Yang, B. Cui, G. Dong, Z. Zhou, and W. Zhang, “Data proportion detection for optimized data management for large language models,” arXiv preprint arXiv:2409.17527, 2024. [Online]. Available: https://arxiv.org/abs/2409.17527
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.