OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
Pith reviewed 2026-05-14 21:38 UTC · model grok-4.3
The pith
Optimal post-hoc merging of distribution vectors from separate models outperforms traditional data mixing for continual pre-training of LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training one CPT model per dataset, extracting its distribution vector that represents the parameter shift induced by that dataset, and searching for optimal composition weights post-hoc via Bayesian optimization produces continual pre-training performance that consistently exceeds both data-mixture baselines and model-averaging baselines; the resulting weights can be read as effective mixture ratios whose use in retraining further improves the mixed-data approach, and the same vector pool can be re-optimized for new objectives to produce tailored models without retraining.
What carries the argument
The distribution vector, the parameter shift induced by training on one specific dataset, which is then linearly combined with other vectors using post-hoc optimized weights.
If this is right
- Optimized weights can be interpreted directly as data mixture ratios, and retraining a model with those ratios improves over the original data-mixture baseline.
- The same pool of distribution vectors can be re-optimized for any new objective to produce target-tailored models without additional training.
- Ratio search cost drops by a factor of 15-35 compared with conventional pre-training tuning loops.
- The method outperforms both data-mixture training and direct model averaging on the same set of models.
Where Pith is reading between the lines
- The approach could extend to other multi-source adaptation settings where data sources are combined after individual training runs.
- Parameter shifts induced by different datasets appear approximately additive under linear weighting for the tested model sizes and domains.
- A library of domain-specific models could be maintained and recombined on demand for many downstream goals without repeated full training.
- Non-linear merging operators or cheaper optimization methods might further reduce the remaining search cost while preserving the post-hoc flexibility.
Load-bearing premise
Distribution vectors extracted from separately trained models can be linearly combined via optimized weights to match or exceed the performance of a single model trained on the mixed data from the start.
What would settle it
A controlled experiment that retrains a model from scratch on the data mixture given by the optimized weights and finds it performs strictly worse than the vector-merged model would falsify the central claim.
read the original abstract
Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OptiMer for continual pre-training of LLMs: train one model per dataset to extract a distribution vector (parameter shift induced by that dataset), then use Bayesian optimization to find post-hoc composition weights for linear merging of these vectors. On Gemma 3 27B, this is claimed to outperform data-mixture and model-averaging baselines for Japanese/Chinese language and Math/Code domain adaptation, at 15-35x lower search cost; the optimized weights are also shown to be reusable as mixture ratios for retraining and the vector pool can be re-optimized without further training.
Significance. If the central claim holds, the work would provide a practical way to decouple expensive ratio tuning from the training run itself, allowing reuse of already-trained models and on-demand re-optimization for new objectives. This could meaningfully reduce compute waste in continual pre-training pipelines.
major comments (2)
- [Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).
- [Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.
minor comments (1)
- [Introduction] Notation for 'distribution vector' is introduced without a formal definition or equation; a precise mathematical statement (e.g., as the difference in parameters after training on one dataset) would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed review. We address the major comments point-by-point below, clarifying our experimental design and committing to revisions that directly strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).
Authors: We agree this is a valuable clarification. Our current experiments show OptiMer outperforming data-mixture baselines (tuned via standard grid search within the same compute budget) and demonstrate that the OptiMer-derived weights, when used as mixture ratios for retraining, improve data-mixture CPT performance. However, we did not include a direct head-to-head comparison of the merged model versus a from-scratch model trained on the exact OptiMer ratios with matched total tokens and optimizer state. We will add this experiment in the revision to explicitly test the additivity assumption under identical conditions. revision: yes
-
Referee: [Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.
Authors: We appreciate this observation. The abstract was intentionally concise, but we acknowledge that details on variance, statistical tests, and baseline tuning procedures are important for supporting the performance claims. In the revision we will expand the abstract to reference these elements and ensure the Experiments section provides full details on baseline tuning (e.g., grid search over the same token budget) along with standard deviations across runs and any statistical significance tests performed. revision: yes
Circularity Check
No significant circularity; OptiMer derivation is self-contained
full rationale
The paper trains one CPT model per dataset to extract independent distribution vectors (parameter shifts), then applies external Bayesian optimization post-hoc to search composition weights. This chain does not reduce any claimed result to its own inputs by construction: the vectors are obtained from separate runs, the optimizer is a standard external procedure, and reported gains are validated against data-mixture and averaging baselines rather than being forced by the fitting process itself. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text, and the method's flexibility claim rests on empirical re-optimization experiments rather than definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- Bayesian optimization hyperparameters
axioms (1)
- domain assumption Distribution vectors extracted from individually trained models can be linearly combined to produce an effective merged model.
invented entities (1)
-
distribution vector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define the distribution vector for Di as τ_i = θ_CPTi − θ_pt … θ_merge = θ_pt + α_it · τ_it + Σ α_i · τ_i … vectors from distinct datasets are approximately orthogonal, allowing linear combination with minimal interference.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Training dynamics show that CPT trajectories are approximately linear in parameter space, linking merge weights to effective training duration.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.