OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Haiyue Song; Masao Utiyama

arxiv: 2603.28858 · v2 · submitted 2026-03-30 · 💻 cs.CL · cs.AI· cs.LG

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Haiyue Song , Masao Utiyama This is my paper

Pith reviewed 2026-05-14 21:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords continual pre-trainingdata mixture ratiosdistribution vectorsmodel mergingBayesian optimizationlarge language modelsparameter shift

0 comments

The pith

Optimal post-hoc merging of distribution vectors from separate models outperforms traditional data mixing for continual pre-training of LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data mixture ratios, usually fixed before training begins, can instead be chosen after the fact by training one model per dataset, extracting each model's distribution vector that records its parameter shift, and then using Bayesian optimization to find the best linear combination of those vectors. This decouples the expensive ratio search from the training process and yields models that adapt better to target languages and domains than either direct data mixing or simple averaging of the same models. A sympathetic reader cares because the approach cuts the computational cost of ratio tuning by 15-35 times while allowing the same trained models to be recombined on demand for different objectives without any new training runs.

Core claim

Training one CPT model per dataset, extracting its distribution vector that represents the parameter shift induced by that dataset, and searching for optimal composition weights post-hoc via Bayesian optimization produces continual pre-training performance that consistently exceeds both data-mixture baselines and model-averaging baselines; the resulting weights can be read as effective mixture ratios whose use in retraining further improves the mixed-data approach, and the same vector pool can be re-optimized for new objectives to produce tailored models without retraining.

What carries the argument

The distribution vector, the parameter shift induced by training on one specific dataset, which is then linearly combined with other vectors using post-hoc optimized weights.

If this is right

Optimized weights can be interpreted directly as data mixture ratios, and retraining a model with those ratios improves over the original data-mixture baseline.
The same pool of distribution vectors can be re-optimized for any new objective to produce target-tailored models without additional training.
Ratio search cost drops by a factor of 15-35 compared with conventional pre-training tuning loops.
The method outperforms both data-mixture training and direct model averaging on the same set of models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other multi-source adaptation settings where data sources are combined after individual training runs.
Parameter shifts induced by different datasets appear approximately additive under linear weighting for the tested model sizes and domains.
A library of domain-specific models could be maintained and recombined on demand for many downstream goals without repeated full training.
Non-linear merging operators or cheaper optimization methods might further reduce the remaining search cost while preserving the post-hoc flexibility.

Load-bearing premise

Distribution vectors extracted from separately trained models can be linearly combined via optimized weights to match or exceed the performance of a single model trained on the mixed data from the start.

What would settle it

A controlled experiment that retrains a model from scratch on the data mixture given by the optimized weights and finds it performs strictly worse than the vector-merged model would falsify the central claim.

read the original abstract

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptiMer extracts parameter-shift vectors from separate per-dataset runs and optimizes their linear combination afterward, cutting mixture search cost 15-35x while letting the same vectors serve new objectives without retraining.

read the letter

OptiMer trains one continual pre-training model per dataset, extracts the distribution vector for the parameter shift each run produces, and then uses Bayesian optimization to find the best weights for combining those vectors. The resulting weights serve as mixture ratios for a fresh training run. This moves the expensive ratio search out of the main training loop and into a cheap post-hoc step on already-computed vectors. On Gemma 3 27B for Japanese, Chinese, math, and code, the method beats standard data-mixing and averaging baselines at 15-35 times lower search cost. The same vector pool can be re-optimized for different targets without new training, which is the practical payoff for anyone who adapts models repeatedly to new languages or domains. Retraining with the discovered ratios also improves over conventional mixing, showing the ratios themselves are useful even if you do not use the merged model directly. The main soft spot is the additivity assumption. Separate runs do not reproduce the gradient interference and optimizer-state effects that occur when batches from multiple datasets are interleaved from the start. The paper shows that the optimized ratios work when used for retraining, but a direct comparison of the post-hoc merged model against a mixed-data model trained with exactly those ratios under identical token counts and settings would clarify whether the merging step adds value beyond ratio discovery. Non-linear interactions could still matter in some cases. This is for practitioners who tune continual pre-training mixtures often and want to reduce the number of full runs they launch. The scale of the experiments and the size of the reported gains are enough to justify sending it to a serious referee, though the review should check the exact controls, statistical reporting, and whether the merging step itself outperforms simply using the discovered ratios in a standard mixed run.

Referee Report

2 major / 1 minor

Summary. The paper proposes OptiMer for continual pre-training of LLMs: train one model per dataset to extract a distribution vector (parameter shift induced by that dataset), then use Bayesian optimization to find post-hoc composition weights for linear merging of these vectors. On Gemma 3 27B, this is claimed to outperform data-mixture and model-averaging baselines for Japanese/Chinese language and Math/Code domain adaptation, at 15-35x lower search cost; the optimized weights are also shown to be reusable as mixture ratios for retraining and the vector pool can be re-optimized without further training.

Significance. If the central claim holds, the work would provide a practical way to decouple expensive ratio tuning from the training run itself, allowing reuse of already-trained models and on-demand re-optimization for new objectives. This could meaningfully reduce compute waste in continual pre-training pipelines.

major comments (2)

[Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).
[Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.

minor comments (1)

[Introduction] Notation for 'distribution vector' is introduced without a formal definition or equation; a precise mathematical statement (e.g., as the difference in parameters after training on one dataset) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review. We address the major comments point-by-point below, clarifying our experimental design and committing to revisions that directly strengthen the evidence for our claims.

read point-by-point responses

Referee: [Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).

Authors: We agree this is a valuable clarification. Our current experiments show OptiMer outperforming data-mixture baselines (tuned via standard grid search within the same compute budget) and demonstrate that the OptiMer-derived weights, when used as mixture ratios for retraining, improve data-mixture CPT performance. However, we did not include a direct head-to-head comparison of the merged model versus a from-scratch model trained on the exact OptiMer ratios with matched total tokens and optimizer state. We will add this experiment in the revision to explicitly test the additivity assumption under identical conditions. revision: yes
Referee: [Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.

Authors: We appreciate this observation. The abstract was intentionally concise, but we acknowledge that details on variance, statistical tests, and baseline tuning procedures are important for supporting the performance claims. In the revision we will expand the abstract to reference these elements and ensure the Experiments section provides full details on baseline tuning (e.g., grid search over the same token budget) along with standard deviations across runs and any statistical significance tests performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OptiMer derivation is self-contained

full rationale

The paper trains one CPT model per dataset to extract independent distribution vectors (parameter shifts), then applies external Bayesian optimization post-hoc to search composition weights. This chain does not reduce any claimed result to its own inputs by construction: the vectors are obtained from separate runs, the optimizer is a standard external procedure, and reported gains are validated against data-mixture and averaging baselines rather than being forced by the fitting process itself. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text, and the method's flexibility claim rests on empirical re-optimization experiments rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that parameter shifts from separate datasets can be treated as independent vectors that combine linearly to approximate mixed-data training effects.

free parameters (1)

Bayesian optimization hyperparameters
Hyperparameters controlling the search over composition weights are chosen for the post-hoc optimization step.

axioms (1)

domain assumption Distribution vectors extracted from individually trained models can be linearly combined to produce an effective merged model.
The method treats the vectors as additive representations of dataset-induced parameter shifts.

invented entities (1)

distribution vector no independent evidence
purpose: Encodes the parameter shift induced by training on one specific dataset.
New representation introduced to enable post-hoc merging.

pith-pipeline@v0.9.0 · 5531 in / 1390 out tokens · 57992 ms · 2026-05-14T21:38:13.186202+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the distribution vector for Di as τ_i = θ_CPTi − θ_pt … θ_merge = θ_pt + α_it · τ_it + Σ α_i · τ_i … vectors from distinct datasets are approximately orthogonal, allowing linear combination with minimal interference.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training dynamics show that CPT trajectories are approximately linear in parameter space, linking merge weights to effective training duration.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.