arxiv: 2604.16380 · v1 · submitted 2026-03-25 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords data mixingLLM pretrainingdata compositionbilevel optimizationstatic mixingdynamic mixingtaxonomytransferability

0 comments

The pith

Data mixing for LLM pretraining is formalized as a bilevel optimization problem on the probability simplex and classified into static and dynamic methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how to optimally mix data from different domains when pretraining large language models under limited compute and data budgets. It formalizes the optimization of domain sampling weights as a bilevel problem where the upper level selects weights and the lower level trains the model. This matters because the composition of training data strongly influences both training efficiency and how well the model generalizes to new tasks. The survey organizes prior work into a taxonomy separating static methods, which fix weights in advance, from dynamic methods that adjust weights during training. It then examines practical challenges such as methods failing to transfer across different settings and the lack of standard evaluation practices.

Core claim

This survey formalizes data mixture optimization as a bilevel problem on the probability simplex and introduces a fine-grained taxonomy that divides existing methods into static mixing, further split into rule-based and learning-based, and dynamic mixing, grouped into adaptive and externally guided families. For each category it summarizes representative approaches and analyzes their performance-cost trade-offs. It identifies cross-cutting challenges including limited transferability across domains, models, and validation sets, plus unstandardized benchmarks, and proposes future directions such as finer-grained domain partitioning and pipeline-aware designs.

What carries the argument

The bilevel optimization formulation on the probability simplex, which treats domain weights as upper-level variables optimized to make the inner model training process tractable under fixed budgets.

Load-bearing premise

That the taxonomy dividing methods into static versus dynamic categories along with their subfamilies comprehensively organizes all existing data mixing approaches without significant omissions or overlaps.

What would settle it

A new data mixing technique that adjusts domain weights during pretraining in a manner that fits none of the four subcategories (rule-based, learning-based, adaptive, externally guided) or that cannot be reduced to the bilevel simplex formulation.

Figures

Figures reproduced from arXiv: 2604.16380 by Deyi Xiong, Supryadi, Yuxuan Miao, Zhuo Chen.

**Figure 1.** Figure 1: Taxonomy of data mixing. Data mixing Static Dynamic Rule-based Learning-based Adaptive Externally guided Proxy optimizationbased Predictionbased Uniform Sampling Proportional Sampling TemperatureBased Sampling UniMax [12] UtiliMax& MEDU [13] DoReMi [14] DoGE [15] DML [16] BiMix [17] AutoScale [18] Data Mixing Scaling Laws [19] RegMix [20] MixMin [21] MDE [22] MFMS-GP [23] ADMIRE [24] ODM [25] ADO [26] V… view at source ↗

**Figure 2.** Figure 2: Proxy optimization-based methods employ a proxy model as the optimization carrier. An iterative optimization algorithm is executed on the proxy model, which uses internal signals from its training process (e.g., losses and gradients) to drive the optimization and obtain an approximately optimal data mixture. The resulting data mixture is then directly applied to training the target main model [PITH_FULL_I… view at source ↗

**Figure 3.** Figure 3: Prediction-based methods assume that model performance is a function f of the chosen data mixture. Once this function f has been learned, one can derive the theoretically optimal data mixture. Such methods are typically instantiated in two ways: explicit prediction and implicit prediction. Explicit prediction methods model f as an explicit function and fit it using observations collected from a set of prox… view at source ↗

**Figure 4.** Figure 4: Dynamic methods adjust the data mixture on the fly during training of the target main model. Depending on the signal used to drive these updates, they can be divided into two categories. Adaptive methods use internal signals from the target main model (e.g., losses and gradients) as the driving signal, whereas externally guided methods use an external controller to process training-produced signals and out… view at source ↗

read the original abstract

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limited budgets more effectively. In recent years, a growing body of work has proposed principled data mixing methods for LLM pretraining; however, the literature remains fragmented and lacks a dedicated, systematic survey. This paper provides a comprehensive review of data mixing for LLM pretraining. We first formalize data mixture optimization as a bilevel problem on the probability simplex and clarify the role of data mixing in the pretraining pipeline, and briefly explain how existing methods make this formulation tractable in practice. We then introduce a fine-grained taxonomy that organizes existing methods along two dimensions: static versus dynamic mixing. Static mixing is further categorized into rule-based and learning-based methods, while dynamic mixing is grouped into adaptive and externally guided families. For each class, we summarize representative approaches and analyze their strengths and limitations from a performance-cost trade-off perspective. Building on this analysis, we highlight challenges that cut across methods, including limited transferability across data domains, optimization objectives, models, and validation sets, as well as unstandardized evaluation protocols and benchmarks, and the inherent tension between performance gains and cost control in learning-based methods. Finally, we outline several exploratory directions, including finer-grained domain partitioning and inverse data mixing, as well as pipeline-aware designs, aiming to provide conceptual and methodological insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey formalizes data mixing as bilevel optimization on the simplex and gives a static/dynamic taxonomy that pulls scattered LLM pretraining work into one place, but it stays a review without new methods or tests.

read the letter

The core value here is the organization. The paper sets up data mixing as choosing domain weights on the probability simplex before the main training loop, which matches how people actually allocate compute budgets. It then splits methods into static ones (rule-based or learned once) and dynamic ones (adaptive during training or guided by external signals), and for each it lays out representative papers plus their performance versus cost trade-offs. That framing and the cross-cutting challenges section on transferability, benchmark inconsistency, and the performance-cost tension are the parts that could actually help someone planning a pretraining run. The bilevel view is standard but stated clearly enough to be useful as a reference point. What the paper does not do is run fresh experiments or validate the taxonomy against held-out methods, so its claim to comprehensiveness rests on the literature search rather than new evidence. The exploratory directions at the end (finer domain splits, inverse mixing, pipeline-aware designs) are reasonable but stay high-level. No obvious gaps in the described categories, though any survey can miss recent arXiv uploads. This is for researchers who train or tune large models under data constraints and want a map rather than another new algorithm. A serious editor should send it to review because the structure and the performance-cost analysis give it practical utility even without novel results.

Referee Report

1 major / 0 minor

Summary. The paper provides a comprehensive review of data mixing for LLM pretraining. It formalizes data mixture optimization as a bilevel problem on the probability simplex, introduces a taxonomy organizing methods into static (rule-based and learning-based) and dynamic (adaptive and externally guided) categories, summarizes representative approaches with their strengths and limitations from a performance-cost perspective, highlights cross-cutting challenges such as limited transferability and unstandardized evaluations, and outlines future directions including finer-grained domain partitioning and inverse data mixing.

Significance. This survey is significant as it addresses the fragmented literature on data mixing, which is critical for efficient LLM pretraining under data and compute budgets. The bilevel formalization offers a unifying framework, the taxonomy helps organize methods, and the analysis of challenges provides insights for future research. It gives credit to existing work by analyzing trade-offs.

major comments (1)

[Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Taxonomy section] Taxonomy section: The claim that the static/dynamic taxonomy comprehensively organizes all existing methods is load-bearing for the survey's contribution; the manuscript should include an explicit table or appendix mapping all reviewed papers to the categories (rule-based, learning-based, adaptive, externally guided) to allow verification of coverage and to address the risk of omissions in the fine-grained partitioning.

Authors: We agree that an explicit mapping table would strengthen verifiability of the taxonomy's coverage. In the revised version we will add an appendix table that enumerates every paper reviewed in the survey and assigns it to one of the four leaf categories (rule-based, learning-based, adaptive, externally guided). This table will also note the primary data domains and optimization objectives used in each work, allowing readers to check for omissions and to assess the fine-grained partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard survey formalization and taxonomy

full rationale

This is a survey paper whose core contribution is an organizational taxonomy and a standard bilevel formalization of data mixing as optimization over the probability simplex. The bilevel framing is presented as a clarification of the pretraining pipeline role rather than a derivation from the paper's own fitted results or self-citations. The static/dynamic taxonomy partitions existing external methods without overlap claims that reduce to the authors' inputs, and the cross-cutting challenges are summarized directly from reviewed performance-cost analyses in the literature. No equations, predictions, or uniqueness theorems are shown to collapse by construction to the paper's own definitions or prior self-citations; all summarized approaches are attributed to independent prior work. The structure is self-contained as a review with no load-bearing internal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey paper, the central claim rests on the completeness of the literature review, the validity of the bilevel formalization as a unifying lens, and the assumption that the taxonomy captures the key distinctions in the field.

axioms (1)

domain assumption The literature on data mixing for LLM pretraining is fragmented and lacks a dedicated systematic survey.
Stated directly in the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5587 in / 1295 out tokens · 41102 ms · 2026-05-15T00:54:00.377774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 8 internal anchors

[1]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McC...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

X. Bi, Y. Zang, Y. Wang, Y. Mao, Y. Wang, Y. Guo, H. Liu, et al., “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954 , 2024. [Online]. Available: https:// arxiv.org/abs/2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

GPT-4 Technical Report

[Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[6]

FuxiTranyu: A multilingual large language model trained with balanced data,

H. Sun, R. Jin, S. Xu, L. Pan, Supryadi, M. Cui, J. Du, Y. Lei, L. Yang, L. Shi, J. Xiao, S. Zhu, and D. Xiong, “FuxiTranyu: A multilingual large language model trained with balanced data,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , F. Dernoncourt, D. Preoţiuc-Pietro, and A. Shimorina, Eds. ...

work page 2024
[7]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. A...

work page 2017
[8]

Datacomp-lm: In search of the next generation of training sets for language models,

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Y. Gadre, H. Bansal, E. K. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y. Bitton, M. Nezhurina, A. Abbas, C.-Y. Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, G. Daras, K. Marathe, A. Gokaslan, J. Zha...

work page 2024
[9]

Data, data everywhere: A guide for pretraining dataset construction,

J. Parmar, S. Prabhumoye, J. Jennings, B. Liu, A. Jhunjhunwala, Z. Wang, M. Patwary, M. Shoeybi, and B. Catanzaro, “Data, data everywhere: A guide for pretraining dataset construction,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Y. Al-Onaizan, M. Bansal, and Y.-N. Chen, Eds. Miami, Florida, USA: Association...

work page 2024
[10]

A survey on data selection for language models,

A. Albalak, Y. Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, C. Raffel, S. Chang, T. Hashimoto, and W. Y. Wang, “A survey on data selection for language models,” Transactions on Machine Learning Research , July 2024, tMLR (certified Survey Featured). [Online]. Available: https://openreview.net/forum?id=XfHWcNTSHp

work page 2024
[11]

A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,

S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts, B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno, and D. Ippolito, “A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human L...

work page 2024
[12]

GLaM: Efficient scaling of language models with mixture-of-experts,

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui, “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proceedi...

work page 2022
[14]

Available: https://arxiv.org/abs/2304.09151

[Online]. Available: https://arxiv.org/abs/2304.09151

work page arXiv
[15]

Optimizing pretraining data mixtures with llm-estimated utility,

W. Held, B. Paranjape, P. S. Koura, M. Lewis, F. Zhang, and T. Mihaylov, “Optimizing pretraining data mixtures with llm-estimated utility,” arXiv preprint arXiv:2501.11747 , 2025. [Online]. Available: https:// arxiv.org/abs/2501.11747

work page arXiv 2025
[16]

Doremi: Optimizing data mixtures speeds up language model pretraining,

S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Optimizing data mixtures speeds up language model pretraining,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/ dcba6be91359358c2355cd920da3fcbd-Abstract-Conference.html

work page 2023
[17]

DOGE: Domain reweighting with generalization estimation,

S. Fan, M. Pagliardini, and M. Jaggi, “DOGE: Domain reweighting with generalization estimation,” in Proceedings of the 41 st International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, Eds., vol. 235. PMLR, 21–27 Jul 2024, pp. 12...

work page 2024
[18]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance,

J. Ye, P. Liu, T. Sun, J. Zhan, Y. Zhou, and X. Qiu, “Data mixing laws: Optimizing data mixtures by predicting language modeling performance,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://openreview.net/forum?id=jjCB27TMK3

work page 2025
[19]

Bimix: A bivariate data mixing law for language model pretraining

C. Ge, Z. Ma, D. Chen, Y. Li, and B. Ding, “Bimix: Bivariate data mixing law for language model pretraining,” arXiv preprint arXiv:2405.14908, 2024. [Online]. Available: https://arxiv.org/abs/2405.14908

work page arXiv 2024
[20]

Autoscale: Scale-aware data mixing for pre-training llms,

F. Kang, Y. Sun, B. Wen, S. Chen, D. Song, R. Mahmood, and R. Jia, “Autoscale: Scale-aware data mixing for pre-training llms,” in Conference on Language Modeling (COLM 2025) , 2025. [Online]. Available: https://openreview.net/forum?id=rujwIvjooA. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 50

work page 2025
[21]

Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,

M. Shukor, L. Bethune, D. Busbridge, D. Grangier, E. Fini, A. El-Nouby, and P. Ablin, “Scaling laws for optimal data mixtures,” arXiv preprint arXiv:2507.09404 , 2025. [Online]. Available: https://arxiv.org/ abs/2507.09404

work page arXiv 2025
[22]

Regmix: Data mixture as regression for language model pre-training,

Q. Liu, X. Zheng, N. Muennighoff, G. Zeng, L. Dou, T. Pang, J. Jiang, and M. Lin, “Regmix: Data mixture as regression for language model pre-training,” in International Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://proceedings.iclr.cc/paper_files/paper/2025/ hash/ 5f67d864aae6115374fed7beddd119e0-Abstract-Conference.html

work page 2025
[23]

Mixmin: Finding data mixtures via convex minimization,

A. Thudi, E. Rovers, Y. Ruan, T. Thrush, and C. J. Maddison, “Mixmin: Finding data mixtures via convex minimization,” in International Conference on Machine Learning (ICML 2025), Poster , 2025. [Online]. Available: https://openreview.net/forum?id=wpaxYGgp2n

work page 2025
[24]

Optimizing pre-training data mixtures with mixtures of data expert models,

L. Belenki, A. Agarwal, T. Shi, and K. Toutanova, “Optimizing pre-training data mixtures with mixtures of data expert models,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguisti...

work page 2025
[25]

Data mixture optimization: A multi-fidelity multi-scale bayesian framework,

T. Yen, A. W. T. Siah, H. Chen, T. Peng, C. D. Guetta, and H. Namkoong, “Data mixture optimization: A multi-fidelity multi-scale bayesian framework,” in NeurIPS 2025, Poster, 2025. [Online]. Available: https:// openreview.net/forum?id=Kvsa8ZXd0W

work page 2025
[27]

Available: https://arxiv.org/abs/2508.11551

[Online]. Available: https://arxiv.org/abs/2508.11551

work page arXiv
[28]

arXiv preprint arXiv:2312.02406 , year=

A. Albalak, L. Pan, C. Raffel, and W. Y. Wang, “Efficient online data mixing for language model pre- training,” arXiv preprint arXiv:2312.02406, 2023. [Online]. Available: https://arxiv.org/abs/2312.02406

work page arXiv 2023
[29]

Adaptive data optimization: Dynamic sample selection with scaling laws,

Y. Jiang, A. Zhou, Z. Feng, S. Malladi, and J. Z. Kolter, “Adaptive data optimization: Dynamic sample selection with scaling laws,” in The Thirteenth International Conference on Learning Representations , 2025, iCLR 2025 Poster. [Online]. Available: https://iclr.cc/virtual/2025/poster/29145

work page 2025
[30]

Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,

Z. Luo, X. Zhang, X. Liu, H. Li, Y. Gong, Q. Chen, and P. Cheng, “Velocitune: A velocity-based dynamic domain reweighting method for continual pre-training,” in Proceedings of the 63 rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 16644–...

work page 2025
[31]

Pike: Adaptive data mixing for multitask learning under low gradient conflicts,

Z. Li, Y. Deng, P. Zhong, M. Razaviyayn, and V. Mirrokni, “Pike: Adaptive data mixing for multitask learning under low gradient conflicts,” arXiv preprint arXiv:2502.06244, 2025. [Online]. Available: https:// arxiv.org/abs/2502.06244

work page arXiv 2025
[32]

Grape: Optimize data mixture for group robust multi-target adaptive pretraining,

S. Fan, M. I. Glarou, and M. Jaggi, “Grape: Optimize data mixture for group robust multi-target adaptive pretraining,” in Thirty-ninth Conference on Neural Information Processing Systems , 2025, neurIPS 2025 Poster. [Online]. Available: https://openreview.net/forum?id=JRmIvBcnWc

work page 2025
[33]

Actor-critic based online data mixing for language model pre-training,

J. Ma, C. Dang, and M. Liao, “Actor-critic based online data mixing for language model pre-training,” arXiv preprint arXiv:2505.23878, 2025. [Online]. Available: https://arxiv.org/abs/2505.23878

work page arXiv 2025
[34]

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

K. Yang, X. Liu, L. Ji, H. Li, Y. Gong, P. Cheng, and M. Yang, “Data mixing agent: Learning to re-weight domains for continual pre-training,” arXiv preprint arXiv:2507.15640 , 2025. [Online]. Available: https:// arxiv.org/abs/2507.15640

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Tikmix: Take data influence into dynamic mixture for language model pre-training,

Y. Wang, B. Liu, F. Liu, Y. Guo, J. Deng, X. Wu, W. Zhou, X. Zhou, and T. Wang, “Tikmix: Take data influence into dynamic mixture for language model pre-training,” arXiv preprint arXiv:2508.17677, 2025. [Online]. Available: https://arxiv.org/abs/2508.17677. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 51

work page arXiv 2025
[36]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P.-S. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Else...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

OLMo: Accelerating the science of language models,

D. Groeneveld, I. Beltagy, E. P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. H. Saunders...

work page 2024
[38]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 201...

work page 2019
[39]

Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,

F. Luo, W. Li, Y. Liu, B. Huang, Z. Wang, J. Guo, X.-L. Sun, and W.-Y. Liu, “Veco: Variable and flexible cross-lingual pre-training for language understanding and generation,” in Proceedings of the 59 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Aug. 2021, pp. 3980–3993. [Online]. ...

work page 2021
[40]

Unsupervised cross-lingual representation learning at scale,

A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistics . Online: Association for Computational Linguistics, Jul. 2020, pp. 8440–8451....

work page 2020
[41]

Xlm-e: Cross-lingual language model pre-training via ELECTRA,

Z. Chi, L. Dong, F. Wei, W. Wang, N. Yang, S. Singhal, S. Wang, X. Song, S. Ma, S. Huang, M. Zhou, and F. Wei, “Xlm-e: Cross-lingual language model pre-training via ELECTRA,” in Proceedings of the 60 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2...

work page 2022
[42]

Multilingual denoising pre-training for neural machine translation,

Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/

work page 2020
[43]

Training sparse mixture of experts text embedding models,

Z. Nussbaum and B. Duderstadt, “Training sparse mixture of experts text embedding models,” arXiv preprint arXiv:2502.07972, 2025. [Online]. Available: https://arxiv.org/abs/2502.07972. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 52

work page arXiv 2025
[44]

mt5: A massively multilingual pre-trained text-to-text transformer,

L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Online: Association for Computational Linguisti...

work page 2021
[45]

mmbert: A modern multilingual encoder with annealed language learning,

M. Marone, O. Weller, W. Fleshman, E. Yang, D. Lawrie, and B. Van Durme, “mmbert: A modern multilingual encoder with annealed language learning,” arXiv preprint arXiv:2509.06888, 2025. [Online]. Available: https://arxiv.org/abs/2509.06888

work page arXiv 2025
[46]

Portfolio selection,

H. Markowitz, “Portfolio selection,” The Journal of Finance , vol. 7, no. 1, pp. 77–91, March 1952. [Online]. Available: https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1952.tb01525.x

work page doi:10.1111/j.1540-6261.1952.tb01525.x 1952
[47]

Markowitz portfolio construction at seventy,

S. Boyd, K. Johansson, R. Kahn, P. Schiele, and T. Schmelzer, “Markowitz portfolio construction at seventy,” The Journal of Portfolio Management, vol. 50, no. 8, pp. 117–160, 2024, special Issue Dedicated to Harry Markowitz. [Online]. Available: https://www.pm-research.com/content/iijpormgmt/50/8/117

work page 2024
[48]

Robust stochastic approximation approach to stochastic programming,

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization , vol. 19, no. 4, pp. 1574–1609, 2009. [Online]. Available: https://epubs.siam.org/doi/10.1137/070704277

work page doi:10.1137/070704277 2009
[49]

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,

S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2020. [Online]. Available: https:// openreview.net/forum?id=ryxGuJrFvS

work page 2020
[50]

Distributionally robust language modeling,

Y. Oren, S. Sagawa, T. B. Hashimoto, and P. Liang, “Distributionally robust language modeling,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9 th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 422...

work page 2019
[51]

Prioritized training on points that are learnable, worth learning, and not yet learnt,

S. Mindermann, A. Bengs, J. Hooper, Y. Gal, and A. Weller, “Prioritized training on points that are learnable, worth learning, and not yet learnt,” in Proceedings of the 39 th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162....

work page 2022
[52]

An empirical analysis of compute-optimal large language model training,

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. Rae, O. Vinyals, and L. Sifre, “An empirical analysis of compute-optimal large language model training,” in Adv...

work page 2022
[53]

Lightgbm: A highly efficient gradient boosting decision tree,

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 3146–3154. [Online]. Available: https://papers.nips.cc/paper_files/paper/2017/hash/ 6449f44a102fde848669bdd9eb6b76fa-Abstract.html

work page 2017
[54]

C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. [Online]. Available: https://direct.mit.edu/books/oa-monograph/2320/Gaussian-Processes-for-Machine-Learning

work page 2006
[55]

Multivariable functional interpolation and adaptive networks,

D. S. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Systems, vol. 2, no. 3, pp. 321–355, 1988. [Online]. Available: https://www.complex-systems.com/ abstracts/v02 i03 a05/. Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 53

work page 1988
[56]

Practical multi-fidelity bayesian optimization for hyperparameter tuning,

J. Wu, S. Toscano-Palmerin, P. I. Frazier, and A. G. Wilson, “Practical multi-fidelity bayesian optimization for hyperparameter tuning,” in Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, ser. Proceedings of Machine Learning Research, R. P. Adams and V. Gogate, Eds., vol. 115. PMLR, 22–25 Jul 2020, pp. 788–798. [Online]. Availab...

work page 2020
[57]

The nonstochastic multiarmed bandit problem,

P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002. [Online]. Available: https: //epubs.siam.org/ doi/10.1137/S0097539701398375

work page doi:10.1137/s0097539701398375 2002
[58]

Neuronlike adaptive elements that can solve difficult learning control problems,

A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics , vol. SMC-13, no. 5, pp. 834–846, 1983. [Online]. Available: https://incompleteideas.net/papers/barto-sutton-anderson-83.pdf

work page 1983
[59]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” in Advances in Neural Information Processing Systems , vol. 33, 2020. [Online]. Available: https:// proceedings.neurips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html

work page 2020
[60]

Aioli: A unified optimization framework for language model data mixing,

M. F. Chen, M. Y. Hu, N. Lourie, K. Cho, and C. Ré, “Aioli: A unified optimization framework for language model data mixing,” in International Conference on Learning Representations (ICLR) , 2025. [Online]. Available: https://openreview.net/forum?id=sZGZJhaNSe

work page 2025
[61]

Unsupervised topic models are data mixers for pre-training language models,

J. Peng, X. Zhuang, J. Qiu, R. Ma, J. Yu, T. Bai, and C. He, “Unsupervised topic models are data mixers for pre-training language models,” arXiv preprint arXiv:2502.16802 , 2025. [Online]. Available: https://arxiv. org/abs/2502.16802

work page arXiv 2025
[62]

Some methods for classification and analysis of multivariate observations,

J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, L. M. Le Cam and J. Neyman, Eds. Berkeley, CA: University of California Press, 1967, pp. 281–297. [Online]. Available: https://projecteuclid.org/ebooks...

work page arXiv 1967
[63]

k-means++: The advantages of careful seeding,

D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07) . New Orleans, Louisiana, USA: Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035. [Online]. Available: https:// dl.acm.org/doi/10.5555/1283383.1283494

work page doi:10.5555/1283383.1283494 2007
[64]

Climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161,

S. Diao, Y. Yang, Y. Fu, X. Dong, D. Su, M. Kliegl, Z. Chen, P. Belcak, Y. Suhara, H. Yin, M. Patwary, Y. Lin, J. Kautz, and P. Molchanov, “Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training,” arXiv preprint arXiv:2504.13161, 2025. [Online]. Available: https://arxiv.org/ abs/2504.13161

work page arXiv 2025
[65]

Domain2vec: Vectorizing datasets to find the optimal data mixture without training,

M. Zhang, H. Tissue, L. Wang, and X. Qiu, “Domain2vec: Vectorizing datasets to find the optimal data mixture without training,” arXiv preprint arXiv:2506.10952, 2025. [Online]. Available: https://arxiv.org/ abs/2506.10952

work page arXiv 2025
[66]

Balanced data sampling for language model training with clustering,

Y. Shao, L. Li, Z. Fei, H. Yan, D. Lin, and X. Qiu, “Balanced data sampling for language model training with clustering,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V. Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 14 012–14 023. [Online]. Available: https...

work page 2024
[67]

Dynamic gradient alignment for online data mixing,

S. Fan, D. Grangier, and P. Ablin, “Dynamic gradient alignment for online data mixing,” OpenReview preprint, 2025, iCLR 2025 submission. [Online]. Available: https://openreview.net/forum?id=O3SatrdL97

work page 2025
[68]

Dids: Domain impact-aware data sampling for large language model training,

W. Shi, J. Zhang, Y. Wu, J. Fang, S. Zhang, Y. Zhao, H. Chen, R. Zhang, Y. Cui, J. Zhu, S. Han, J. Xu, and X. Zhou, “Dids: Domain impact-aware data sampling for large language model training,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , C. Christodoulopoulos, T. Data Mixing for Large Language Models Pretrain...

work page 2025
[69]

Toremi: Topic-aware data reweighting for dynamic pre-training data selection,

X. Zhu, Z. Gu, S. Zheng, T. Wang, T. Li, H. Feng, and Y. Xiao, “Toremi: Topic-aware data reweighting for dynamic pre-training data selection,” arXiv preprint arXiv:2504.00695, 2025. [Online]. Available: https:// arxiv.org/abs/2504.00695

work page arXiv 2025
[70]

R&b: Domain regrouping and data mixture balancing for efficient foundation model training,

A. Ge, T.-H. Huang, J. Cooper, A. Trost, Z. Chu, S. S. S. Namburi GNVV, Z. Cai, K. Park, N. Roberts, and F. Sala, “R&b: Domain regrouping and data mixture balancing for efficient foundation model training,” arXiv preprint arXiv:2505.00358, 2025. [Online]. Available: https://arxiv.org/abs/2505.00358

work page arXiv 2025
[71]

Skill-it! a data-driven skills framework for understanding and training language models,

M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang, F. Sala, and C. Ré, “Skill-it! a data-driven skills framework for understanding and training language models,” in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2023/hash/70b8505ac 79e3e131756f793cd80eb8d-Abstract-Conference.html

work page 2023
[72]

Ideal: Data equilibrium adaptation for multi-capability language model alignment,

C. Ming, C. Qu, M. Cai, Q. Pei, Z. Pan, Y. Li, X. Duan, L. Wu, and C. He, “Ideal: Data equilibrium adaptation for multi-capability language model alignment,” arXiv preprint arXiv:2505.12762, 2025. [Online]. Available: https://arxiv.org/abs/2505.12762

work page arXiv 2025
[73]

AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,

N. E. Corrado, J. Katz-Samuels, A. M. Devraj, H. Yun, C. Zhang, Y. Xu, Y. Pan, B. Yin, and T. Chilimbi, “AutoMixAlign: Adaptive data mixing for multi-task preference optimization in LLMs,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, ...

work page 2025
[74]

Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,

W. Xie, F. Tonin, and V. Cevher, “Chameleon: A flexible data-mixing framework for language model pretraining and finetuning,” arXiv preprint arXiv:2505.24844 , 2025. [Online]. Available: https://arxiv.org/ abs/2505.24844

work page arXiv 2025
[75]

Sharp analysis of low-rank kernel matrix approximations,

F. Bach, “Sharp analysis of low-rank kernel matrix approximations,” in Proceedings of the 26 th Annual Conference on Learning Theory , ser. Proceedings of Machine Learning Research, S. Shalev-Shwartz and I. Steinwart, Eds., vol. 30. Princeton, NJ, USA: PMLR, 12–14 Jun 2013, pp. 185–209. [Online]. Available: https://proceedings.mlr.press/v30/Bach13.html

work page 2013
[76]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[77]

SampleMix: A sample-wise pre-training data mixing strategy by coordinating data quality and diversity,

X. Xi, D. Kong, J. Yang, J. Yang, Z. Chen, W. Wang, J. Wang, X. Cai, S. Zhang, and W. Ye, “SampleMix: A sample-wise pre-training data mixing strategy by coordinating data quality and diversity,” in Findings of the Association for Computational Linguistics: EMNLP 2025 , C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, Eds. Suzhou, China: Associ...

work page 2025
[78]

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

Z. Chen, G. K. R. Lau, C.-S. Foo, and B. K. H. Low, “Duet: Optimizing training data mixtures via feedback from unseen evaluation tasks,” arXiv preprint arXiv:2502.00270 , 2025. [Online]. Available: https://arxiv. org/abs/2502.00270

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,

F. Liu, W. Zhou, B. Liu, Z. Yu, Y. Zhang, H. Lin, Y. Yu, X. Zhou, T. Wang, and Y. Cao, “Quadmix: Quality- diversity balanced data selection for efficient llm pretraining,” arXiv preprint arXiv:2504.16511 , 2025. [Online]. Available: https://arxiv.org/abs/2504.16511

work page arXiv 2025
[80]

Data mixture inference attack: BPE tokenizers reveal training data compositions,

J. Hayase, A. Liu, Y. Choi, S. Oh, and N. A. Smith, “Data mixture inference attack: BPE tokenizers reveal training data compositions,” in Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , Data Mixing for Large Language Models Pretraining: A Survey and Outlook Data Intelligence 55

work page 2024
[81]

Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/10e6dfea9a673bef4 a7b1cb9234891bc-Abstract-Conference.html

[Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2024/hash/10e6dfea9a673bef4 a7b1cb9234891bc-Abstract-Conference.html

work page 2024
[82]

Neural machine translation of rare words with subword units,

R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54 th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), K. Erk and N. A. Smith, Eds. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online]. Available: htt...

work page 2016
[83]

Data proportion detection for optimized data management for large language models,

H. Liang, K. Zhao, Y. Yang, B. Cui, G. Dong, Z. Zhou, and W. Zhang, “Data proportion detection for optimized data management for large language models,” arXiv preprint arXiv:2409.17527, 2024. [Online]. Available: https://arxiv.org/abs/2409.17527

work page arXiv 2024