pith. sign in

arxiv: 2605.29489 · v1 · pith:YY63EMNTnew · submitted 2026-05-28 · 💻 cs.LG · cs.SY· eess.SY

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Pith reviewed 2026-06-29 09:09 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords model mergingweight-space mergingexpert access setsI/O budgetingLLM mergingdelta blocksMergePipe
0
0 comments X

The pith

MergePipe budgets expert weight reads to enable scalable LLM merging with order-of-magnitude I/O savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that at LLM scale the real constraint on weight-space merging is the I/O cost of reading expert checkpoints, not the algebraic operation itself. It formulates the task as choosing which expert delta blocks to access under an explicit budget, then constructs deterministic access plans that remain sound by construction. For fixed-coefficient additive merges these plans bound the error by the norm of the omitted deltas and recover the full merge when the budget is unlimited. Experiments on Qwen and Llama workloads show up to 10x less I/O, 11x speedups, and parameter deviation of order 10 to the minus 3 with no benchmark degradation. A reader would care because the approach makes repeated merging of large models feasible on ordinary hardware rather than requiring simultaneous loading of every checkpoint.

Core claim

MergePipe indexes parameter blocks, builds deterministic access plans under a stated I/O budget, and executes the resulting budgeted merge via replayable manifests. For fixed-coefficient additive operators the omitted-update error is bounded by the norm of the omitted deltas; the plan recovers the full-read merge exactly when the budget permits full access.

What carries the argument

The expert access-set problem that selects subsets of delta blocks under an explicit I/O budget while preserving merge semantics through deterministic, budget-sound plans.

If this is right

  • Expert-read I/O falls by up to an order of magnitude on representative Qwen and Llama merging workloads.
  • Execution time improves by up to 11 times compared with full-read baselines.
  • Parameter deviation from the full-read result stays at O(10^{-3}) across tested budgets.
  • Downstream benchmark scores exhibit no monotonic degradation even when the budget is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same access-planning layer could be applied to other weight-space operations once error bounds are derived for non-additive operators.
  • Access budgeting may become a standard middleware layer for any operation that must combine multiple large checkpoints.
  • Hardware with memory too small to hold all checkpoints simultaneously could still perform merges that previously required full simultaneous residency.

Load-bearing premise

The checkpoints occupy a shared weight coordinate system and the merge operators are fixed-coefficient additions of deltas.

What would settle it

A direct run on the same Qwen or Llama workload showing that the budgeted merge deviates by substantially more than 0.001 in parameter norm or produces consistent benchmark degradation relative to the full-access merge.

Figures

Figures reproduced from arXiv: 2605.29489 by Congkai Xie, Hongxia Yang, Jianmin Wu, Su Lu, Yanggan Gu, Yifan Yang, Yuanyi Wang, Zhaoyi Yan.

Figure 1
Figure 1. Figure 1: Budgeted access sets in weight space. Full-read merging fixes A = 1. MergePipe chooses a budget-feasible access mask A and executes the induced mask-aware operator Ψop; omitted entries are represented by the mask and do not trigger expert reads. 3. MergePipe MergePipe realizes the access-mask abstraction through a catalog–plan–execute loop; detailed algorithms are in Ap￾pendix A.2. Given a base checkpoint,… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling with the number of experts. Full-read merging repeatedly scans expert checkpoints, so expert-read I/O and wall time grow with K. MergePipe enforces a fixed expert-I/O budget, keeping expert reads nearly flat and shifting the remaining cost toward the unavoidable checkpoint boundary. 10 20 30 40 50 60 70 80 90100 I/O budget (% of full expert read) 0 10 20 30 40 Expert read (GB) Dashed: budget cap (G… view at source ↗
Figure 3
Figure 3. Figure 3: Budget-aware planning behavior. (a) Realized ex￾pert reads grow monotonically with the requested I/O budget and remain under the cap. (b) End-to-end wall time follows expert￾read volume. (c) The fraction of accessed expert blocks expands smoothly as more budget is allocated. merging has near-linear expert-I/O growth and matching wall-time growth as K increases. MergePipe keeps the access set within B, maki… view at source ↗
Figure 4
Figure 4. Figure 4: MergePipe system overview. The runtime realizes budget-aware weight-space merging through block-level cataloging, access-set planning, mask-aware execution, and manifest-based replay. The planner controls expert-delta reads under the I/O budget, while the executor streams only selected expert blocks and materializes the resulting logical checkpoint. use coefficients αi,t,b, and let the selected-only budget… view at source ↗
Figure 5
Figure 5. Figure 5: Where MergePipe saves time. Top-left: planning, flush, and commit are small relative to execution. Top-right: tightening the budget primarily removes expert reads, while base reads and output writes remain nearly fixed. Bottom: before budgeting, expert reads scale with the number of experts. coordinate system and does not address permutation, sym￾metry, or representation alignment. Budgeted merging is ap￾p… view at source ↗
read the original abstract

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that MergePipe, by casting LLM merging as an expert access-set problem and building deterministic access plans under an I/O budget, can reduce expert-read I/O by up to an order of magnitude and achieve up to 11× speedups on Qwen and Llama workloads. It states that the plans are budget-sound by construction, recover the full merge at full budget, and for fixed-coefficient additive operators the omitted-update error is bounded by the norm of omitted deltas, with empirical results showing O(10^{-3}) parameter deviation and no monotonic degradation on downstream benchmarks.

Significance. If the results hold, this work could make weight-space model merging more scalable by addressing the I/O bottleneck at LLM scale. The explicit construction of budget-sound plans and the error bound are notable strengths, as is the empirical demonstration on real models.

major comments (1)
  1. [Abstract] The error bound for the omitted-update is stated but no derivation or proof is provided, which is central to validating the correctness of the budgeted merge for the restricted operator class.
minor comments (2)
  1. [Abstract] The experimental protocol, including how budgets are swept and data-exclusion rules, is not described, making it difficult to reproduce the reported speedups and deviation results.
  2. [Abstract] The specific merge operators and checkpoint families used in the Qwen and Llama experiments are not named, limiting the ability to assess the generality of the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential significance and for the constructive comment. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] The error bound for the omitted-update is stated but no derivation or proof is provided, which is central to validating the correctness of the budgeted merge for the restricted operator class.

    Authors: We agree that the abstract states the bound without an accompanying derivation. The current manuscript provides only an informal justification in the methods section. In the revised version we will add an explicit proof (or detailed derivation) showing that, for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of the omitted deltas. This proof will be placed in the main text (likely as a new subsection or appendix reference) rather than left implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an explicit algorithmic construction

full rationale

The paper defines MergePipe as a budget-aware layer that builds deterministic access plans which are budget-sound by construction and recover the full merge at full budget. This is a definitional property of the proposed execution layer rather than a derived claim. The error bound for fixed-coefficient additive operators is stated directly from the norm of omitted deltas under the shared-coordinate assumption. Empirical results (I/O reduction, speedups, parameter deviation) are reported as measurements on Qwen/Llama workloads, not as predictions obtained by fitting to the same data. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation chain is therefore self-contained as an engineering construction with explicitly stated operating assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full paper text unavailable, so ledger is necessarily incomplete. The shared coordinate system and fixed-coefficient additive operator class are the only explicit premises visible.

axioms (1)
  • domain assumption Checkpoints reside in a shared weight coordinate system.
    Required for the merge operator to be well-defined across experts.
invented entities (1)
  • MergePipe no independent evidence
    purpose: Budget-aware execution layer that indexes blocks and builds deterministic access plans.
    New system introduced by the paper; no external evidence supplied.

pith-pipeline@v0.9.1-grok · 5738 in / 1266 out tokens · 35435 ms · 2026-06-29T09:09:10.764544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 7 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  2. [2]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InPro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2368–2378,

  3. [3]

    InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

    Gu, Y ., Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Wu, F., and Yang, H. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878,

  4. [4]

    FeatCal: Feature Calibration for Post-Merging Models

    Gu, Y ., Cai, S., Wang, Z., Wang, W., Wang, Y ., Wang, P., Huang, S., Lu, S., Wu, J., and Yang, H. Featcal: Fea- ture calibration for post-merging models.arXiv preprint arXiv:2605.13030,

  5. [5]

    Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a

    Liu, S., Wu, H., He, B., Han, X., Yuan, M., and Song, L. Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a. Liu, Z., Wu, H., Yao, Y ., She, R., Han, X., Zhong, T., and Yuan, M. Lore-merging: Exploring low-rank estima- tion for large language model merging.arXiv preprint arXiv:25...

  6. [6]

    H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N

    Nobari, A. H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421,

  7. [7]

    Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

    Wang, P., Liu, P., Wang, Y ., Chen, G., Ren, X., Li, X., Hao, Z., Kong, Y ., Zhang, Q., and Ni, D. Discovering physical directions in weight space: Composing neural pde experts.arXiv preprint arXiv:2605.14546, 2026a. Wang, W., Gu, Y ., Cai, S., Wang, Y ., Wang, P., Wu, J., and Yang, H. E-pmq: Expert-guided post-merge quan- tization with merged-weight anch...

  8. [8]

    Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

    Wang, Y ., Lu, S., Gu, Y ., Wang, P., Yang, Y ., Yan, Z., Xie, C., Wu, J., and Yang, H. Not all disagreement is learnable: Token teachability in on-policy distillation, 2026c. Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Gu, Y ., Wu, F., and Yang, H. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion.Advances in Neura...

  9. [9]

    Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

    Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

  10. [10]

    Instruction-Following Evaluation for Large Language Models

    Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  11. [11]

    Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing

    I/O Breakdown and Overhead.Figure 5 shows that the gains come from reducing the expert-read term rather than from metadata effects. Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing. C. Limitations MergePipe targets budgeted weight-space access for check- p...