pith. sign in

arxiv: 2603.28858 · v2 · submitted 2026-03-30 · 💻 cs.CL · cs.AI· cs.LG

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Pith reviewed 2026-05-14 21:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords continual pre-trainingdata mixture ratiosdistribution vectorsmodel mergingBayesian optimizationlarge language modelsparameter shift
0
0 comments X

The pith

Optimal post-hoc merging of distribution vectors from separate models outperforms traditional data mixing for continual pre-training of LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data mixture ratios, usually fixed before training begins, can instead be chosen after the fact by training one model per dataset, extracting each model's distribution vector that records its parameter shift, and then using Bayesian optimization to find the best linear combination of those vectors. This decouples the expensive ratio search from the training process and yields models that adapt better to target languages and domains than either direct data mixing or simple averaging of the same models. A sympathetic reader cares because the approach cuts the computational cost of ratio tuning by 15-35 times while allowing the same trained models to be recombined on demand for different objectives without any new training runs.

Core claim

Training one CPT model per dataset, extracting its distribution vector that represents the parameter shift induced by that dataset, and searching for optimal composition weights post-hoc via Bayesian optimization produces continual pre-training performance that consistently exceeds both data-mixture baselines and model-averaging baselines; the resulting weights can be read as effective mixture ratios whose use in retraining further improves the mixed-data approach, and the same vector pool can be re-optimized for new objectives to produce tailored models without retraining.

What carries the argument

The distribution vector, the parameter shift induced by training on one specific dataset, which is then linearly combined with other vectors using post-hoc optimized weights.

If this is right

  • Optimized weights can be interpreted directly as data mixture ratios, and retraining a model with those ratios improves over the original data-mixture baseline.
  • The same pool of distribution vectors can be re-optimized for any new objective to produce target-tailored models without additional training.
  • Ratio search cost drops by a factor of 15-35 compared with conventional pre-training tuning loops.
  • The method outperforms both data-mixture training and direct model averaging on the same set of models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other multi-source adaptation settings where data sources are combined after individual training runs.
  • Parameter shifts induced by different datasets appear approximately additive under linear weighting for the tested model sizes and domains.
  • A library of domain-specific models could be maintained and recombined on demand for many downstream goals without repeated full training.
  • Non-linear merging operators or cheaper optimization methods might further reduce the remaining search cost while preserving the post-hoc flexibility.

Load-bearing premise

Distribution vectors extracted from separately trained models can be linearly combined via optimized weights to match or exceed the performance of a single model trained on the mixed data from the start.

What would settle it

A controlled experiment that retrains a model from scratch on the data mixture given by the optimized weights and finds it performs strictly worse than the vector-merged model would falsify the central claim.

read the original abstract

Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they must be fixed before training begins, and a suboptimal choice can waste weeks of compute. In this work, we propose OptiMer, which decouples ratio selection from training: we train one CPT model per dataset, extract each model's distribution vector, which represents the parameter shift induced by that dataset, and search for optimal composition weights post-hoc via Bayesian optimization. Experiments on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code) show that OptiMer consistently outperforms data mixture and model averaging baselines with 15-35 times lower search cost. Key findings reveal that 1) the optimized weights can be interpreted as data mixture ratios, and retraining with these ratios improves data mixture CPT, and 2) the same vector pool can be re-optimized for a given objective without any retraining, producing target-tailored models on demand. Our work establishes that data mixture ratio selection, traditionally a pre-training decision, can be reformulated as a post-hoc optimization over distribution vectors, offering a more flexible paradigm for continual pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes OptiMer for continual pre-training of LLMs: train one model per dataset to extract a distribution vector (parameter shift induced by that dataset), then use Bayesian optimization to find post-hoc composition weights for linear merging of these vectors. On Gemma 3 27B, this is claimed to outperform data-mixture and model-averaging baselines for Japanese/Chinese language and Math/Code domain adaptation, at 15-35x lower search cost; the optimized weights are also shown to be reusable as mixture ratios for retraining and the vector pool can be re-optimized without further training.

Significance. If the central claim holds, the work would provide a practical way to decouple expensive ratio tuning from the training run itself, allowing reuse of already-trained models and on-demand re-optimization for new objectives. This could meaningfully reduce compute waste in continual pre-training pipelines.

major comments (2)
  1. [Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).
  2. [Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.
minor comments (1)
  1. [Introduction] Notation for 'distribution vector' is introduced without a formal definition or equation; a precise mathematical statement (e.g., as the difference in parameters after training on one dataset) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review. We address the major comments point-by-point below, clarifying our experimental design and committing to revisions that directly strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Experiments] The central claim that linear merging of independently extracted distribution vectors is equivalent or superior to joint training on the corresponding data mixture rests on an untested additivity assumption. No head-to-head experiment compares the merged model against a model trained from scratch on the mixed data using exactly the same optimized ratios, total tokens, and optimizer state (Experiments section).

    Authors: We agree this is a valuable clarification. Our current experiments show OptiMer outperforming data-mixture baselines (tuned via standard grid search within the same compute budget) and demonstrate that the OptiMer-derived weights, when used as mixture ratios for retraining, improve data-mixture CPT performance. However, we did not include a direct head-to-head comparison of the merged model versus a from-scratch model trained on the exact OptiMer ratios with matched total tokens and optimizer state. We will add this experiment in the revision to explicitly test the additivity assumption under identical conditions. revision: yes

  2. Referee: [Abstract] The abstract states consistent outperformance on Gemma 3 27B but provides no information on statistical tests, variance across runs, or precise baseline implementations (e.g., how the data-mixture baselines were tuned). This information is load-bearing for the claim that OptiMer is better than data mixing.

    Authors: We appreciate this observation. The abstract was intentionally concise, but we acknowledge that details on variance, statistical tests, and baseline tuning procedures are important for supporting the performance claims. In the revision we will expand the abstract to reference these elements and ensure the Experiments section provides full details on baseline tuning (e.g., grid search over the same token budget) along with standard deviations across runs and any statistical significance tests performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OptiMer derivation is self-contained

full rationale

The paper trains one CPT model per dataset to extract independent distribution vectors (parameter shifts), then applies external Bayesian optimization post-hoc to search composition weights. This chain does not reduce any claimed result to its own inputs by construction: the vectors are obtained from separate runs, the optimizer is a standard external procedure, and reported gains are validated against data-mixture and averaging baselines rather than being forced by the fitting process itself. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text, and the method's flexibility claim rests on empirical re-optimization experiments rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that parameter shifts from separate datasets can be treated as independent vectors that combine linearly to approximate mixed-data training effects.

free parameters (1)
  • Bayesian optimization hyperparameters
    Hyperparameters controlling the search over composition weights are chosen for the post-hoc optimization step.
axioms (1)
  • domain assumption Distribution vectors extracted from individually trained models can be linearly combined to produce an effective merged model.
    The method treats the vectors as additive representations of dataset-induced parameter shifts.
invented entities (1)
  • distribution vector no independent evidence
    purpose: Encodes the parameter shift induced by training on one specific dataset.
    New representation introduced to enable post-hoc merging.

pith-pipeline@v0.9.0 · 5531 in / 1390 out tokens · 57992 ms · 2026-05-14T21:38:13.186202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.