Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Pith reviewed 2026-05-21 13:37 UTC · model grok-4.3
The pith
Merging models trained on separate data sources predicts optimal mixtures for LLM pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Experiments show it obtains the optimal mixture with higher benchmark performance at lower search cost.
What carries the argument
Weighted model merging of separately trained component models, which acts as a low-cost proxy for the performance that would result from training a single model from scratch on the corresponding data mixture.
If this is right
- More candidate data sources can be included in the search without a proportional rise in total training cost.
- Discovered mixtures produce models that score higher on benchmarks for both general and specialized capabilities.
- The overall compute budget for data-mixture optimization can be spent on more trials rather than repeated trainings.
- A large-scale, validated pre-training corpus with the resulting mixtures becomes available for further use.
Where Pith is reading between the lines
- The same merging proxy could be tested on data-composition problems outside pre-training, such as continued training or multi-domain fine-tuning.
- If the proxy remains accurate at frontier scales, it would allow systematic inclusion of many more niche data sources than current practice permits.
- Incremental or dynamic versions of the merge step might eventually support adjusting data ratios while training is still underway.
Load-bearing premise
The performance of a weighted merge of models each trained on a single data source accurately predicts the performance of a model trained from scratch on the corresponding data mixture.
What would settle it
Train one full model from scratch on a data mixture whose performance was previously estimated by merging, then compare its benchmark scores to the merged model's scores; a large gap would show the proxy does not hold.
read the original abstract
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DeMix, a framework for scaling data mixture search in LLM pre-training. Component models are trained separately on individual candidate datasets; weighted merging of their parameters then serves as a proxy for the performance of a model trained from scratch on the corresponding mixture. This decouples search cost from training cost, permitting evaluation of far more candidate mixtures than direct proxy training would allow. Experiments claim that the resulting mixtures yield higher downstream benchmark scores than prior methods while incurring lower total search compute; the authors also release the 22T-token DeMix Corpora.
Significance. If the proxy assumption is shown to be reliable, DeMix would materially reduce the compute required to discover effective pre-training mixtures, addressing a practical bottleneck in scaling LLMs. The public release of a large, validated corpus further supports reproducibility and follow-on work on data-centric LLM training.
major comments (3)
- [§3 and §4] §3 (Method) and §4 (Experiments): The central claim rests on the untested assumption that a weighted parameter average of independently trained single-dataset models accurately predicts the performance of a model trained on the corresponding data mixture. No scatter plots, Pearson/Spearman correlations, or direct side-by-side comparisons between merged-proxy accuracy and actual mixture-trained accuracy are reported, leaving the justification for unlimited mixture evaluation without additional training unsupported.
- [§4.3] §4.3 (Results): The manuscript reports higher benchmark scores for DeMix-discovered mixtures but provides neither error bars on the proxy evaluations nor details on how the final mixture coefficients were selected after search. This raises the possibility that reported gains reflect post-hoc selection rather than a robust improvement over baselines.
- [§3.2] §3.2 (Merging procedure): Because component models optimize separate loss landscapes on disjoint data distributions and typically begin from independent random initializations, parameter-space interpolation need not reproduce the optimization trajectory or learned features that would arise from joint training on the mixed distribution. The paper offers no ablation or theoretical argument addressing this mismatch.
minor comments (2)
- [§3] Notation for merging coefficients is introduced without an explicit equation; adding a numbered equation would improve clarity.
- [§4] Table captions and axis labels in the experimental figures should explicitly state whether reported numbers are proxy or actual-mixture results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and describe the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The central claim rests on the untested assumption that a weighted parameter average of independently trained single-dataset models accurately predicts the performance of a model trained on the corresponding data mixture. No scatter plots, Pearson/Spearman correlations, or direct side-by-side comparisons between merged-proxy accuracy and actual mixture-trained accuracy are reported, leaving the justification for unlimited mixture evaluation without additional training unsupported.
Authors: We agree that explicit validation of the proxy assumption strengthens the central claim. In the revised manuscript we will add a dedicated subsection containing scatter plots, Pearson and Spearman correlations, and direct side-by-side comparisons between merged-proxy accuracies and the performance of models trained from scratch on the same mixtures. These analyses will be performed on a representative held-out subset of mixtures at reduced scale. revision: yes
-
Referee: [§4.3] §4.3 (Results): The manuscript reports higher benchmark scores for DeMix-discovered mixtures but provides neither error bars on the proxy evaluations nor details on how the final mixture coefficients were selected after search. This raises the possibility that reported gains reflect post-hoc selection rather than a robust improvement over baselines.
Authors: We will add error bars (computed over multiple random seeds) to all proxy and final-model benchmark results. We will also expand the description of the coefficient selection procedure, clarifying that a separate validation set was used to finalize the mixture weights after the search phase, thereby reducing the risk of post-hoc selection bias. revision: yes
-
Referee: [§3.2] §3.2 (Merging procedure): Because component models optimize separate loss landscapes on disjoint data distributions and typically begin from independent random initializations, parameter-space interpolation need not reproduce the optimization trajectory or learned features that would arise from joint training on the mixed distribution. The paper offers no ablation or theoretical argument addressing this mismatch.
Authors: We acknowledge that parameter merging is an approximation and does not exactly replicate joint training. The revised version will include a new ablation study that directly compares merged-proxy performance against models trained from scratch on the corresponding mixtures at small scale. While a complete theoretical characterization of the approximation remains an open research question, the added empirical evidence will demonstrate the practical reliability of the proxy. revision: partial
Circularity Check
No circularity: empirical proxy validated on external benchmarks
full rationale
The paper presents DeMix as an empirical framework that trains separate component models on candidate datasets and then applies weighted parameter merging to generate performance proxies for data mixtures. No equations or derivations are provided that reduce the claimed optimal mixture or benchmark gains to a fitted quantity defined by the same inputs. The central proxy assumption (merged model performance approximates mixture-trained performance) is treated as a testable hypothesis rather than a definitional identity, and the paper reports validation against external benchmarks and released corpora. No self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the given text. The method is therefore self-contained against independent evaluation and receives a score of 0.
Axiom & Free-Parameter Ledger
free parameters (1)
- merging coefficients
axioms (1)
- domain assumption Weighted averaging of parameters from models trained on disjoint data sources produces a model whose downstream performance approximates that of a model trained on the weighted data mixture.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the arithmetic sum of weight deltas from models trained on separate datasets closely approximates the weight delta obtained by training on their union (Qin et al., 2022; Wu et al., 2025; Lin et al., 2025b): Δ(Di ∪ Dj) ≈ Δ(Di) + Δ(Dj)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
merging the component models {Θi} at a specific ratio {αi} can obtain a proxy model for any real Θmix trained on {αiDi}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.