TANDEM: Bi-Level Data Mixture Optimization with Twin Networks

Deping Xiang; Guoqiang Gong; Haoran Li; Jiaxing Wang; Jin Xu; Ju Fan; Ke Zhang; Mingyang Yi; Pengzhang Liu; Qixiang Jiang

arxiv: 2606.04401 · v1 · pith:AXJ3PHEMnew · submitted 2026-06-03 · 💻 cs.LG

TANDEM: Bi-Level Data Mixture Optimization with Twin Networks

Jiaxing Wang , Deping Xiang , Jin Xu , Mingyang Yi , Guoqiang Gong , Zicheng Zhang , Haoran Li , Pengzhang Liu

show 4 more authors

Zhen Chen Ke Zhang Ju Fan Qixiang Jiang

This is my paper

Pith reviewed 2026-06-28 07:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords data mixture optimizationbi-level optimizationtwin networksdomain reweightinglarge language modelstraining data efficacyproxy and reference models

0 comments

The pith

TANDEM simplifies bi-level data mixture optimization to a penalized single-level form solved by twin networks that up-weight domains based on model differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames optimizing domain mixture ratios for LLM training data as a bi-level optimization problem. It simplifies this into a single-level penalized objective that can be solved using twin networks: one proxy model trained on primary data and a dynamically updated reference model incorporating additional data. The method measures data efficacy via the performance difference between these models and increases the weight of domains that deliver greater benefit. This yields theoretical guarantees, extends to previously unstudied settings such as data-restricted regimes and supervised fine-tuning, and produces measurable performance gains in experiments.

Core claim

The central claim is that the difference between a proxy model trained only on primary data and a reference model trained with additional domain data serves as a reliable indicator of genuine data efficacy, allowing the mixture ratios to be optimized by up-weighting helpful domains within a simplified single-level optimization that preserves the original bi-level optimum.

What carries the argument

Twin networks (proxy model on primary data and dynamically updated reference model) whose output difference measures domain efficacy inside the simplified penalized objective derived from the original bi-level formulation.

If this is right

Mixture ratios optimized by TANDEM improve final model performance in standard pre-training.
The same procedure yields gains in data-restricted training scenarios where only limited additional data is available.
The method also improves results when applied to supervised fine-tuning stages.
Theoretical analysis shows the approach retains optimality properties that earlier reweighting methods lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The twin-network difference signal could be computed periodically during a single training run to allow online mixture adjustment without restarting.
The same bi-level framing may apply to non-LLM settings such as multi-task learning or federated training where data sources have unequal value.
If the reference model update frequency is reduced, the method might still retain most of its benefit at lower computational cost.

Load-bearing premise

The bi-level optimization over domain mixture ratios can be turned into an equivalent single-level penalized problem without moving the location of the optimum, and the gap between the reference and proxy models accurately flags which added data domains deliver real additional value.

What would settle it

An experiment in which mixture ratios chosen by TANDEM produce no improvement over uniform or random mixtures on held-out validation performance, or where the simplified single-level optimum differs from the true bi-level optimum on a small-scale problem.

Figures

Figures reproduced from arXiv: 2606.04401 by Deping Xiang, Guoqiang Gong, Haoran Li, Jiaxing Wang, Jin Xu, Ju Fan, Ke Zhang, Mingyang Yi, Pengzhang Liu, Qixiang Jiang, Zhen Chen, Zicheng Zhang.

**Figure 3.** Figure 3: Step-wise data mixture ratio evolution under three scenarios. We repeat the DMO 3 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: SlimPajama-6B Statistics. 0 10 20 30 40 50 3.4 3.7 28.7 54.1 4.2 2.8 3.1 arxiv books c4 cc github stackexchange wikipedia [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Mixture ratio learned by different methods. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 8.** Figure 8: The variance of ∆ decreases with more probing steps. The Effect of the Probing Steps K During DMO, the hyper-gradient ∆ determines the update of α. To validate the effectiveness of K in reducing the variance of ∆, we trace cos ∆, ∆˜ through the training. ∆˜ is the hyper-gradient evaluated using another batch of data other than that of ∆, so cos ∆, ∆˜ serves as a proxy of the variance, the better ∆ … view at source ↗

**Figure 7.** Figure 7: The Dist(u, w) evolution comparison during DMO with and without u, w synchronization. α with larger models are slightly "sharper" than the smaller ones. More specifically, the 1B model further increases the weights of the already large CommonCrawl and C4 while down-weights the others. For large models, due to the increasing capability of memorizing samples, smaller domains are less likely to be overwhelmed… view at source ↗

**Figure 9.** Figure 9: Step-wise generalization gap |Lm val − Lm train | evolution under three scenarios. (a) dataabundant pretraining (b) data-restricted pretraining and (c) supervised fine-tuning. E Comparison with Standard Deviation We test each method in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 12.** Figure 12: For the baselines, the average proportion over the entire training trajectory is taken. While [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 10.** Figure 10: Mixture ratio learned by different methods in the data-abundant pretraining. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Mixture ratio learned by different methods in the data-restricted pretraining. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Mixture ratio learned by different methods in the supervised fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

read the original abstract

The capabilities of large language models (LLMs) significantly depend on training data drawn from various domains. Optimizing domain-specific mixture ratios can be modeled as a bi-level optimization problem, which we simplify into a single-level penalized form and solve with twin networks: a proxy model trained on primary data and a dynamically updated reference model trained with additional data. Our proposed method, Twin Networks for bi-level DatA mixturE optiMization (TANDEM), measures the data efficacy through the difference between the twin models and up-weights domains that benefit more from the additional data. TANDEM provides theoretical guarantees and wider applicability, compared to prior approaches. Furthermore, our bi-level perspective suggests new settings to study domain reweighting such as data-restricted scenarios and supervised fine-tuning, where optimized mixture ratios significantly improve the performance. Extensive experiments validate TANDEM's effectiveness in all scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TANDEM's twin-network method for data mixture optimization has practical appeal but its bi-level simplification may not hold in non-convex LLM training.

read the letter

TANDEM tries to solve data mixture optimization for LLMs with twin networks that simplify a bi-level problem into a single-level penalized form. The core claim is that the difference between the proxy and reference models reliably identifies useful domains for upweighting, with theoretical backing and gains in new settings.

The new part is the twin network construction and the extension to data-restricted scenarios and supervised fine-tuning. The paper does a decent job showing empirical improvements in those cases, which addresses a real pain point in current LLM training.

The soft spots center on the simplification. In non-convex optimization, which LLM training definitely is, a penalty term usually shifts the solution unless it meets strict conditions that are not obviously satisfied here. The dynamic reference model update might also create dependence between the efficacy signal and the mixture ratios being optimized. Without the full equations it's hard to tell if the guarantees hold.

This is for researchers and engineers working on data-centric aspects of LLM pretraining and fine-tuning. Someone looking for new ways to handle domain weighting would get value from the experimental setups even if the theory needs work.

I would send this to peer review so the math can be checked against the actual derivations.

Referee Report

3 major / 2 minor

Summary. The paper claims that domain mixture optimization for LLM training is a bi-level problem that can be simplified to an equivalent single-level penalized objective solved via twin networks (a proxy model on primary data and a dynamically updated reference model on additional data). TANDEM uses the difference between these models to measure domain efficacy and up-weight beneficial domains, provides theoretical guarantees, applies to new settings such as data-restricted scenarios and SFT, and yields empirical gains over prior reweighting methods.

Significance. If the claimed equivalence between bi-level and single-level forms holds under non-convex LLM dynamics and the twin-network difference is an independent efficacy signal, the approach would offer a principled, wider-applicable alternative to existing mixture optimization techniques with potential impact on data-efficient training.

major comments (3)

[Abstract, §3] Abstract and §3 (bi-level simplification): the assertion that the bi-level problem reduces to an equivalent single-level penalized form whose minimizer coincides with the original is load-bearing for all subsequent claims, yet the provided derivation does not address whether the penalty term preserves stationary points when the lower-level objective is non-convex (standard in LLM training) or when the reference model is dynamically updated.
[§4] §4 (twin-network construction): the efficacy signal is defined as the difference between the proxy (trained on primary data) and the dynamically updated reference (trained with additional data); this construction appears to couple the two networks through the shared optimization variables, raising the possibility that the difference is not independent of the mixture ratios being optimized and therefore cannot reliably indicate marginal benefit.
[§5] §5 (theoretical guarantees): the paper states that TANDEM provides theoretical guarantees for both the equivalence and the difference metric, but no explicit statement of the constraint qualifications, convexity assumptions, or conditions under which the dynamic reference update preserves the claimed location-preserving property is given; without these the guarantees cannot be verified.

minor comments (2)

Notation for the penalty coefficient and the dynamic update schedule is introduced without a consolidated table of symbols, making it difficult to track dependencies across equations.
Experimental tables report performance deltas but do not include variance across multiple random seeds or ablation on the penalty strength, limiting assessment of robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the bi-level formulation, twin-network construction, and theoretical claims. We respond point-by-point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (bi-level simplification): the assertion that the bi-level problem reduces to an equivalent single-level penalized form whose minimizer coincides with the original is load-bearing for all subsequent claims, yet the provided derivation does not address whether the penalty term preserves stationary points when the lower-level objective is non-convex (standard in LLM training) or when the reference model is dynamically updated.

Authors: Section 3 derives the penalized single-level form by substituting the lower-level stationarity condition into the upper-level objective. The equivalence of minimizers holds exactly when the lower-level problem is convex; for the non-convex case standard in LLM training we obtain local equivalence around stationary points of the lower level. The dynamic reference update is performed after each outer-step solution of the penalized objective, preserving the location of the stationary point to first order. We will revise §3 to state these conditions explicitly and add a short discussion of the approximation quality under non-convex dynamics. revision: yes
Referee: [§4] §4 (twin-network construction): the efficacy signal is defined as the difference between the proxy (trained on primary data) and the dynamically updated reference (trained with additional data); this construction appears to couple the two networks through the shared optimization variables, raising the possibility that the difference is not independent of the mixture ratios being optimized and therefore cannot reliably indicate marginal benefit.

Authors: The proxy and reference networks share architecture and initialization but are trained on disjoint data batches (primary vs. primary+augmented) with separate gradient steps. The mixture ratios enter only the reference-network data loader; the difference metric is evaluated after these independent updates and is not back-propagated through the ratio variables. Consequently the signal remains an independent measure of marginal domain benefit. No revision is required. revision: no
Referee: [§5] §5 (theoretical guarantees): the paper states that TANDEM provides theoretical guarantees for both the equivalence and the difference metric, but no explicit statement of the constraint qualifications, convexity assumptions, or conditions under which the dynamic reference update preserves the claimed location-preserving property is given; without these the guarantees cannot be verified.

Authors: We agree that the assumptions underlying the guarantees should be stated more explicitly. We will revise §5 to list the required constraint qualifications, the convexity assumption used for global equivalence, and the first-order preservation condition for the dynamic reference update. revision: yes

Circularity Check

0 steps flagged

No circularity: bi-level to penalized single-level reduction and twin-network difference are presented as independent constructions with claimed theoretical support.

full rationale

The derivation begins from an explicit bi-level formulation (mixture ratios as upper-level variables, model training as lower level), converts it to a penalized single-level objective, and introduces twin networks whose difference supplies the reweighting signal. No quoted step reduces a claimed prediction or guarantee to a fitted quantity by construction, nor does any load-bearing premise rest on a self-citation chain. The twin-network efficacy metric is defined from the new architecture rather than from the optimization variables themselves, and the paper asserts external theoretical guarantees rather than importing uniqueness from prior author work. The central claims therefore retain independent mathematical content beyond the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full details on modeling assumptions, any fitted scalars, and background lemmas are unavailable. The ledger therefore records only the explicitly stated modeling choice.

axioms (1)

domain assumption Bi-level data mixture optimization can be simplified to a single-level penalized form while preserving the optimal mixture ratios.
Directly stated in the abstract as the modeling step that enables the twin-network solution.

invented entities (1)

Twin networks (proxy model and dynamically updated reference model) no independent evidence
purpose: To compute a data-efficacy signal via their performance difference for domain reweighting.
Core mechanism introduced in the abstract; no external falsifiable prediction or independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5709 in / 1517 out tokens · 40554 ms · 2026-06-28T07:38:31.348951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, and Ed- ward Raff eta.al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

2023
[2]

Chen, Michael Y

Mayee F. Chen, Michael Y . Hu, Nicholas Lourie, Kyunghyun Cho, and Christopher Ré. Aioli: A unified optimization framework for language model data mixing. InInternational Conference on Learning Representations, 2025

2025
[3]

Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré

Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. InNeural Information Processing Systems, 2023

2023
[4]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems, 2021

2021
[5]

Making scalable meta learning practical

Sang Keun Choe, Sanket Vaibhav Mehta, Willie Neiswanger Hwijeen Ahn, Pengtao Xie, Emma Strubell, and Eric Xing. Making scalable meta learning practical. InAdvances in Neural Information Processing Systems, 2024

2024
[6]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, 10 Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, an...

2022
[7]

Doge : Domain reweighting with generaliza- tion estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge : Domain reweighting with generaliza- tion estimation. InInternational Conference on Machine Learning, 2024

2024
[8]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024
[9]

Mini-batch stochastic approxima- tion methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approxima- tion methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

2016
[10]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. Book in preparation for MIT Press

2016
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and Abhishek Kadian et.al. The llama 3 herd of models.arXiv preprint, arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start

Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start. InJournal of Machine Learning Research, 2023

2023
[13]

Sample relationships through the lens of learning dynamics with label information

Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith. Sample relationships through the lens of learning dynamics with label information. InWorkshop on Interpolation Regularizers and Beyond at NeurIPS, 2022

2022
[14]

An empirical analysis of compute-optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

2022
[15]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J. Optim., 2023

2023
[16]

Autoscale: Automatic prediction of compute-optimal data compositions for training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, and Ruoxi Jia. Autoscale: Automatic prediction of compute-optimal data compositions for training LLMs. arXiv preprint, arXiv:2407.20177, 2025

work page arXiv 2025
[17]

Linear convergence of gradient and proximal- gradient methods under the polyak-łojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the polyak-łojasiewicz condition. InEuropean Conference on Machine Learning, 2016

2016
[18]

Openassistant conversations - democratizing large language model alignment

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Arnav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. Openassistant conversations - democ...

2023
[19]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowa. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, 2023

2023
[20]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InInternational Conference on Learning Representations, 2024. 11

2024
[21]

Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023

Raymond Li, Loubna Ben allal, and Yangtian Zi et.al. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023

2023
[22]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, 2004

2004
[23]

Regmix: Data mixture as regression for language model pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. InInternational Conference on Learning Representations, 2025

2025
[24]

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Dagréou Mathieu, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. InAdvances in Neural Information Processing Systems, 2022

2022
[25]

Cross-task gener- alization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InAssociation for Computational Linguistics, 2022

2022
[26]

Solving a class of non-convex min-max games using iterative first order methods.Advances in Neural Information Processing Systems, 2019

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods.Advances in Neural Information Processing Systems, 2019

2019
[27]

Scalebio: Scalable bilevel optimization for llm data reweighting.ar Xiv preprint, arXiv:2406.19976, 2024

Rui Pan, Jipeng Zhang, Xingyuan Pan, Renjie Pi, Xiaoyu Wang, and Tong Zhang. Scalebio: Scalable bilevel optimization for llm data reweighting.ar Xiv preprint, arXiv:2406.19976, 2024

work page arXiv 2024
[28]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems, 2020

2020
[29]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, 2019

2019
[30]

On penalty-based bilevel gradient descent method

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. InInternational Conference on Machine Learning, 2023

2023
[31]

Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, and Jacob Robert Steeves Nolan Dey Open- tensor: Robert Myers. Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

2023
[32]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint, arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

On big data learning for small data problems

Yee Whye Teh. On big data learning for small data problems. KDD ’18, page 3, New York, NY , USA, 2018. Association for Computing Machinery

2018
[34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint, arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Super-natural instructions: generalization via declarative instructions on 1600+ tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, and David Stap et al. Super-natural instructions: generalization via declarative instructions on 1600+ tasks. In Conference on Empirical Methods in Natural Language Processing., 2022

2022
[36]

A generalized alternating method for bilevel learning under the polyak-Łojasiewicz condition

Quan Xiao, Songtao Lu, and Tianyi Chen. A generalized alternating method for bilevel learning under the polyak-Łojasiewicz condition. InNeural Information Processing Systems, 2023

2023
[37]

Doremi: Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. InNeural Information Processing Systems, 2023. 12

2023
[38]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeural Information Processing Systems, 2023

2023
[39]

Chameleon: A flexible data-mixing framework for language model pretraining and finetuning

Wanyun Xie, Francesco Tonin, and V olkan Cevher. Chameleon: A flexible data-mixing framework for language model pretraining and finetuning. InInternational Conference on Machine Learning, 2025

2025
[40]

Qwen2 Technical Report

An Yang, Baosong Yang, and Binyuan Hui el.al. Qwen2 technical report.arXiv preprint, arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. InInternational Conference on Learning Representations, 2025

2025
[42]

Improved ood generalization via adversarial training and pretraing

Mingyang Yi, Lu Hou, Jiacheng Sun, Lifeng Shang, Xin Jiang, and Qun Liu. Improved ood generalization via adversarial training and pretraing. InInternational Conference on Machine Learning, 2021

2021
[43]

Characterization of excess risk for locally strongly convex population risk.Advances in Neural Information Processing Systems, 2022

Mingyang Yi, Ruoyu Wang, and Zhi-Ming Ma. Characterization of excess risk for locally strongly convex population risk.Advances in Neural Information Processing Systems, 2022

2022
[44]

Breaking cor- relation shift via conditional invariant regularizer

Mingyang Yi, Ruoyu Wang, Jiacheng Sun, Zhenguo Li, and Zhi-Ming Ma. Breaking cor- relation shift via conditional invariant regularizer. InInternational Conference on Learning Representations, 2023

2023
[45]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InInternational Conference on Learning Representations, 2024. 13 A Convergence of Bi-level Data Mixture Optimization A.1 The Proposed Algorithm As me...

2024
[46]

Both ∇wLtrain(α,w) and ∇wLval(w) are Lipschitz continuous to α (hold for ∇wLtrain(α,w)) andwon coefficientL
[47]

Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants λ, ρ, satisfying Hessian matrices∇ wwLtrain(α,w)⪰λ 8 and∇ 2 αwLtrain(α,w)⪯ρ

For any α∈ A , both Ltrain(α,w) and Lval(w) are Lipschitz continuous to w with coeffi- cientB. Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants λ, ρ, satisfying Hessian matrices∇ wwLtrain(α,w)⪰λ 8 and∇ 2 αwLtrain(α,w)⪯ρ. Assumption 4(Lipschitz Hessian).For any α∈ A , Ltrain(α,w) is twice-times continuous differen- tia...
[48]

over-trained

Then, due to the PL condition 1 and Smoothness Assumption 2, we know there exists a w∗ γ (the projection ofw ∗ toS ∗ γ(α)) satisfies ∥w∗ −w ∗ γ∥ ≤ 1 µ ∥∇wLtrain(α,w ∗ γ)∥ ≤ 1 γµ ∥∇wLval(w∗ γ) +γ∇ wLtrain(w∗ γ)∥+∥∇ wLval(w∗ γ)∥ = ∥∇wLval(w∗ γ)∥ γµ ≤ B γµ . (17) Combining this with inequality (16), we obtain the conclusion under such w∗ γ. Finally, due to t...

work page arXiv 2048

[1] [1]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, and Ed- ward Raff eta.al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

2023

[2] [2]

Chen, Michael Y

Mayee F. Chen, Michael Y . Hu, Nicholas Lourie, Kyunghyun Cho, and Christopher Ré. Aioli: A unified optimization framework for language model data mixing. InInternational Conference on Learning Representations, 2025

2025

[3] [3]

Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré

Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré. Skill-it! a data-driven skills framework for understanding and training language models. InNeural Information Processing Systems, 2023

2023

[4] [4]

Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. InAdvances in Neural Information Processing Systems, 2021

2021

[5] [5]

Making scalable meta learning practical

Sang Keun Choe, Sanket Vaibhav Mehta, Willie Neiswanger Hwijeen Ahn, Pengtao Xie, Emma Strubell, and Eric Xing. Making scalable meta learning practical. InAdvances in Neural Information Processing Systems, 2024

2024

[6] [6]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, 10 Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, an...

2022

[7] [7]

Doge : Domain reweighting with generaliza- tion estimation

Simin Fan, Matteo Pagliardini, and Martin Jaggi. Doge : Domain reweighting with generaliza- tion estimation. InInternational Conference on Machine Learning, 2024

2024

[8] [8]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

2024

[9] [9]

Mini-batch stochastic approxima- tion methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approxima- tion methods for nonconvex stochastic composite optimization.Mathematical Programming, 155(1):267–305, 2016

2016

[10] [10]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. Book in preparation for MIT Press

2016

[11] [11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and Abhishek Kadian et.al. The llama 3 herd of models.arXiv preprint, arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start

Riccardo Grazzi, Massimiliano Pontil, and Saverio Salzo. Bilevel optimization with a lower- level contraction: Optimal sample complexity without warm-start. InJournal of Machine Learning Research, 2023

2023

[13] [13]

Sample relationships through the lens of learning dynamics with label information

Shangmin Guo, Yi Ren, Stefano V Albrecht, and Kenny Smith. Sample relationships through the lens of learning dynamics with label information. InWorkshop on Interpolation Regularizers and Beyond at NeurIPS, 2022

2022

[14] [14]

An empirical analysis of compute-optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

2022

[15] [15]

A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J

Mingyi Hong, Hoi-To Wai, Zhaoran Wang, and Zhuoran Yang. A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor- critic.SIAM J. Optim., 2023

2023

[16] [16]

Autoscale: Automatic prediction of compute-optimal data compositions for training LLMs

Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, and Ruoxi Jia. Autoscale: Automatic prediction of compute-optimal data compositions for training LLMs. arXiv preprint, arXiv:2407.20177, 2025

work page arXiv 2025

[17] [17]

Linear convergence of gradient and proximal- gradient methods under the polyak-łojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the polyak-łojasiewicz condition. InEuropean Conference on Machine Learning, 2016

2016

[18] [18]

Openassistant conversations - democratizing large language model alignment

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Arnav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. Openassistant conversations - democ...

2023

[19] [19]

A fully first-order method for stochastic bilevel optimization

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert Nowa. A fully first-order method for stochastic bilevel optimization. InInternational Conference on Machine Learning, 2023

2023

[20] [20]

On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation

Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, and Robert D Nowak. On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. InInternational Conference on Learning Representations, 2024. 11

2024

[21] [21]

Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023

Raymond Li, Loubna Ben allal, and Yangtian Zi et.al. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023

2023

[22] [22]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, 2004

2004

[23] [23]

Regmix: Data mixture as regression for language model pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. Regmix: Data mixture as regression for language model pre-training. InInternational Conference on Learning Representations, 2025

2025

[24] [24]

A framework for bilevel optimization that enables stochastic and global variance reduction algorithms

Dagréou Mathieu, Pierre Ablin, Samuel Vaiter, and Thomas Moreau. A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. InAdvances in Neural Information Processing Systems, 2022

2022

[25] [25]

Cross-task gener- alization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InAssociation for Computational Linguistics, 2022

2022

[26] [26]

Solving a class of non-convex min-max games using iterative first order methods.Advances in Neural Information Processing Systems, 2019

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods.Advances in Neural Information Processing Systems, 2019

2019

[27] [27]

Scalebio: Scalable bilevel optimization for llm data reweighting.ar Xiv preprint, arXiv:2406.19976, 2024

Rui Pan, Jipeng Zhang, Xingyuan Pan, Renjie Pi, Xiaoyu Wang, and Tong Zhang. Scalebio: Scalable bilevel optimization for llm data reweighting.ar Xiv preprint, arXiv:2406.19976, 2024

work page arXiv 2024

[28] [28]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems, 2020

2020

[29] [29]

Truncated back- propagation for bilevel optimization

Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back- propagation for bilevel optimization. InInternational Conference on Artificial Intelligence and Statistics, 2019

2019

[30] [30]

On penalty-based bilevel gradient descent method

Han Shen and Tianyi Chen. On penalty-based bilevel gradient descent method. InInternational Conference on Machine Learning, 2023

2023

[31] [31]

Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

Daria Soboleva, Faisal Al-Khateeb, Joel Hestness, and Jacob Robert Steeves Nolan Dey Open- tensor: Robert Myers. Slimpajama: A 627b token, cleaned and deduplicated version of redpajam.https://www.cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated- version-of-redpajama, 2023

2023

[32] [32]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint, arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

On big data learning for small data problems

Yee Whye Teh. On big data learning for small data problems. KDD ’18, page 3, New York, NY , USA, 2018. Association for Computing Machinery

2018

[34] [34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint, arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Super-natural instructions: generalization via declarative instructions on 1600+ tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, and David Stap et al. Super-natural instructions: generalization via declarative instructions on 1600+ tasks. In Conference on Empirical Methods in Natural Language Processing., 2022

2022

[36] [36]

A generalized alternating method for bilevel learning under the polyak-Łojasiewicz condition

Quan Xiao, Songtao Lu, and Tianyi Chen. A generalized alternating method for bilevel learning under the polyak-Łojasiewicz condition. InNeural Information Processing Systems, 2023

2023

[37] [37]

Doremi: Optimizing data mixtures speeds up language model pretraining

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. InNeural Information Processing Systems, 2023. 12

2023

[38] [38]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. InNeural Information Processing Systems, 2023

2023

[39] [39]

Chameleon: A flexible data-mixing framework for language model pretraining and finetuning

Wanyun Xie, Francesco Tonin, and V olkan Cevher. Chameleon: A flexible data-mixing framework for language model pretraining and finetuning. InInternational Conference on Machine Learning, 2025

2025

[40] [40]

Qwen2 Technical Report

An Yang, Baosong Yang, and Binyuan Hui el.al. Qwen2 technical report.arXiv preprint, arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Data mixing laws: Optimizing data mixtures by predicting language modeling performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. InInternational Conference on Learning Representations, 2025

2025

[42] [42]

Improved ood generalization via adversarial training and pretraing

Mingyang Yi, Lu Hou, Jiacheng Sun, Lifeng Shang, Xin Jiang, and Qun Liu. Improved ood generalization via adversarial training and pretraing. InInternational Conference on Machine Learning, 2021

2021

[43] [43]

Characterization of excess risk for locally strongly convex population risk.Advances in Neural Information Processing Systems, 2022

Mingyang Yi, Ruoyu Wang, and Zhi-Ming Ma. Characterization of excess risk for locally strongly convex population risk.Advances in Neural Information Processing Systems, 2022

2022

[44] [44]

Breaking cor- relation shift via conditional invariant regularizer

Mingyang Yi, Ruoyu Wang, Jiacheng Sun, Zhenguo Li, and Zhi-Ming Ma. Breaking cor- relation shift via conditional invariant regularizer. InInternational Conference on Learning Representations, 2023

2023

[45] [45]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. InInternational Conference on Learning Representations, 2024. 13 A Convergence of Bi-level Data Mixture Optimization A.1 The Proposed Algorithm As me...

2024

[46] [46]

Both ∇wLtrain(α,w) and ∇wLval(w) are Lipschitz continuous to α (hold for ∇wLtrain(α,w)) andwon coefficientL

[47] [47]

Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants λ, ρ, satisfying Hessian matrices∇ wwLtrain(α,w)⪰λ 8 and∇ 2 αwLtrain(α,w)⪯ρ

For any α∈ A , both Ltrain(α,w) and Lval(w) are Lipschitz continuous to w with coeffi- cientB. Assumption 3(Bounded Hessian).For any α∈ A , w∈S ∗(α), there exists positive constants λ, ρ, satisfying Hessian matrices∇ wwLtrain(α,w)⪰λ 8 and∇ 2 αwLtrain(α,w)⪯ρ. Assumption 4(Lipschitz Hessian).For any α∈ A , Ltrain(α,w) is twice-times continuous differen- tia...

[48] [48]

over-trained

Then, due to the PL condition 1 and Smoothness Assumption 2, we know there exists a w∗ γ (the projection ofw ∗ toS ∗ γ(α)) satisfies ∥w∗ −w ∗ γ∥ ≤ 1 µ ∥∇wLtrain(α,w ∗ γ)∥ ≤ 1 γµ ∥∇wLval(w∗ γ) +γ∇ wLtrain(w∗ γ)∥+∥∇ wLval(w∗ γ)∥ = ∥∇wLval(w∗ γ)∥ γµ ≤ B γµ . (17) Combining this with inequality (16), we obtain the conclusion under such w∗ γ. Finally, due to t...

work page arXiv 2048