Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Fangcheng Shi; Fei Zhao; Haifeng Liu; Jieying Ye; Kaiyan Zhao; Shaosheng Cao; Shengrui Li; Yao Hu; Zheyong Xie

arxiv: 2602.00747 · v2 · pith:37CAJHMWnew · submitted 2026-01-31 · 💻 cs.CL · cs.AI

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Shengrui Li , Fei Zhao , Kaiyan Zhao , Jieying Ye , Haifeng Liu , Fangcheng Shi , Zheyong Xie , Yao Hu

show 1 more author

Shaosheng Cao

This is my paper

Pith reviewed 2026-05-21 13:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords data mixingmodel mergingLLM pre-trainingdata mixture optimizationefficient searchlarge language modelsproxy evaluation

0 comments

The pith

Merging models trained on separate data sources predicts optimal mixtures for LLM pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that trains separate models on individual candidate data sources at scale and then applies weighted merging to estimate the performance of training on any combination of those sources. This decouples the cost of searching for good data ratios from the cost of training, so many more mixtures can be evaluated without additional full-scale runs. A sympathetic reader would care because the right balance of data types such as general text, math, and code determines how capable the final model becomes, yet prior methods were either too small to trust or too expensive to explore thoroughly. The result is a way to discover mixtures that deliver higher benchmark scores while using less total training compute.

Core claim

DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Experiments show it obtains the optimal mixture with higher benchmark performance at lower search cost.

What carries the argument

Weighted model merging of separately trained component models, which acts as a low-cost proxy for the performance that would result from training a single model from scratch on the corresponding data mixture.

If this is right

More candidate data sources can be included in the search without a proportional rise in total training cost.
Discovered mixtures produce models that score higher on benchmarks for both general and specialized capabilities.
The overall compute budget for data-mixture optimization can be spent on more trials rather than repeated trainings.
A large-scale, validated pre-training corpus with the resulting mixtures becomes available for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same merging proxy could be tested on data-composition problems outside pre-training, such as continued training or multi-domain fine-tuning.
If the proxy remains accurate at frontier scales, it would allow systematic inclusion of many more niche data sources than current practice permits.
Incremental or dynamic versions of the merge step might eventually support adjusting data ratios while training is still underway.

Load-bearing premise

The performance of a weighted merge of models each trained on a single data source accurately predicts the performance of a model trained from scratch on the corresponding data mixture.

What would settle it

Train one full model from scratch on a data mixture whose performance was previously estimated by merging, then compare its benchmark scores to the merged model's scores; a large gap would show the proxy does not hold.

read the original abstract

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeMix trains separate component models then merges them to cheaply test data mixtures for LLM pre-training, but the proxy's accuracy is not clearly shown.

read the letter

The main point is that this paper trains one model per candidate dataset at scale, then uses weighted parameter merging to stand in for training on any mixture of those datasets. That lets them try far more mixtures without paying for each one in full training runs. They also put out a 22T-token corpus with the mixtures they settled on. Experiments are said to find mixtures that beat prior ones on benchmarks while cutting search cost. Those are the concrete moves that matter here. The release of the data and code is the part most likely to see use. The merging step itself is not new, but applying it this way to decouple search from training cost is a clean framing for the data-curation problem. The soft spot sits right at the center: the abstract gives no numbers on how well the merged proxy actually tracks the performance of a model trained from scratch on the same mixture. No correlation plots, no error bars on the proxy, and no mention of how merging coefficients were chosen or whether post-selection was done. If the separate models sit in different loss landscapes, their average need not predict joint training on mixed data, which undercuts the whole efficiency claim. The stress-test concern about initialization and optimization paths is worth checking in the full text; if direct head-to-head comparisons are missing, the evidence stays thin. This is for groups already running large pre-training runs and looking for better data ratios. A reader who wants practical tools and the released corpus can extract value even if the proxy is only approximate. It is coherent enough on its own terms to deserve a serious referee, mainly to press on the validation of the merging step and to see the full experimental details.

Referee Report

3 major / 2 minor

Summary. The paper proposes DeMix, a framework for scaling data mixture search in LLM pre-training. Component models are trained separately on individual candidate datasets; weighted merging of their parameters then serves as a proxy for the performance of a model trained from scratch on the corresponding mixture. This decouples search cost from training cost, permitting evaluation of far more candidate mixtures than direct proxy training would allow. Experiments claim that the resulting mixtures yield higher downstream benchmark scores than prior methods while incurring lower total search compute; the authors also release the 22T-token DeMix Corpora.

Significance. If the proxy assumption is shown to be reliable, DeMix would materially reduce the compute required to discover effective pre-training mixtures, addressing a practical bottleneck in scaling LLMs. The public release of a large, validated corpus further supports reproducibility and follow-on work on data-centric LLM training.

major comments (3)

[§3 and §4] §3 (Method) and §4 (Experiments): The central claim rests on the untested assumption that a weighted parameter average of independently trained single-dataset models accurately predicts the performance of a model trained on the corresponding data mixture. No scatter plots, Pearson/Spearman correlations, or direct side-by-side comparisons between merged-proxy accuracy and actual mixture-trained accuracy are reported, leaving the justification for unlimited mixture evaluation without additional training unsupported.
[§4.3] §4.3 (Results): The manuscript reports higher benchmark scores for DeMix-discovered mixtures but provides neither error bars on the proxy evaluations nor details on how the final mixture coefficients were selected after search. This raises the possibility that reported gains reflect post-hoc selection rather than a robust improvement over baselines.
[§3.2] §3.2 (Merging procedure): Because component models optimize separate loss landscapes on disjoint data distributions and typically begin from independent random initializations, parameter-space interpolation need not reproduce the optimization trajectory or learned features that would arise from joint training on the mixed distribution. The paper offers no ablation or theoretical argument addressing this mismatch.

minor comments (2)

[§3] Notation for merging coefficients is introduced without an explicit equation; adding a numbered equation would improve clarity.
[§4] Table captions and axis labels in the experimental figures should explicitly state whether reported numbers are proxy or actual-mixture results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The central claim rests on the untested assumption that a weighted parameter average of independently trained single-dataset models accurately predicts the performance of a model trained on the corresponding data mixture. No scatter plots, Pearson/Spearman correlations, or direct side-by-side comparisons between merged-proxy accuracy and actual mixture-trained accuracy are reported, leaving the justification for unlimited mixture evaluation without additional training unsupported.

Authors: We agree that explicit validation of the proxy assumption strengthens the central claim. In the revised manuscript we will add a dedicated subsection containing scatter plots, Pearson and Spearman correlations, and direct side-by-side comparisons between merged-proxy accuracies and the performance of models trained from scratch on the same mixtures. These analyses will be performed on a representative held-out subset of mixtures at reduced scale. revision: yes
Referee: [§4.3] §4.3 (Results): The manuscript reports higher benchmark scores for DeMix-discovered mixtures but provides neither error bars on the proxy evaluations nor details on how the final mixture coefficients were selected after search. This raises the possibility that reported gains reflect post-hoc selection rather than a robust improvement over baselines.

Authors: We will add error bars (computed over multiple random seeds) to all proxy and final-model benchmark results. We will also expand the description of the coefficient selection procedure, clarifying that a separate validation set was used to finalize the mixture weights after the search phase, thereby reducing the risk of post-hoc selection bias. revision: yes
Referee: [§3.2] §3.2 (Merging procedure): Because component models optimize separate loss landscapes on disjoint data distributions and typically begin from independent random initializations, parameter-space interpolation need not reproduce the optimization trajectory or learned features that would arise from joint training on the mixed distribution. The paper offers no ablation or theoretical argument addressing this mismatch.

Authors: We acknowledge that parameter merging is an approximation and does not exactly replicate joint training. The revised version will include a new ablation study that directly compares merged-proxy performance against models trained from scratch on the corresponding mixtures at small scale. While a complete theoretical characterization of the approximation remains an open research question, the added empirical evidence will demonstrate the practical reliability of the proxy. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proxy validated on external benchmarks

full rationale

The paper presents DeMix as an empirical framework that trains separate component models on candidate datasets and then applies weighted parameter merging to generate performance proxies for data mixtures. No equations or derivations are provided that reduce the claimed optimal mixture or benchmark gains to a fitted quantity defined by the same inputs. The central proxy assumption (merged model performance approximates mixture-trained performance) is treated as a testable hypothesis rather than a definitional identity, and the paper reports validation against external benchmarks and released corpora. No self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear as load-bearing steps in the given text. The method is therefore self-contained against independent evaluation and receives a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that parameter-space merging serves as a faithful proxy for data-mixture training; no new physical entities are postulated and the only free parameters are the merging coefficients searched over the component models.

free parameters (1)

merging coefficients
Weights applied to each component model when forming the merged proxy; these are searched to simulate different data ratios.

axioms (1)

domain assumption Weighted averaging of parameters from models trained on disjoint data sources produces a model whose downstream performance approximates that of a model trained on the weighted data mixture.
This premise is invoked to justify evaluating unlimited mixtures without retraining; it is stated in the description of the DeMix framework.

pith-pipeline@v0.9.0 · 5775 in / 1372 out tokens · 40700 ms · 2026-05-21T13:37:30.582635+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the arithmetic sum of weight deltas from models trained on separate datasets closely approximates the weight delta obtained by training on their union (Qin et al., 2022; Wu et al., 2025; Lin et al., 2025b): Δ(Di ∪ Dj) ≈ Δ(Di) + Δ(Dj)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

merging the component models {Θi} at a specific ratio {αi} can obtain a proxy model for any real Θmix trained on {αiDi}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
cs.LG 2026-04 unverdicted novelty 4.0

A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.