Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Congkai Xie; Hongxia Yang; Jianmin Wu; Su Lu; Yanggan Gu; Yifan Yang; Yuanyi Wang; Zhaoyi Yan

arxiv: 2605.29489 · v1 · pith:YY63EMNTnew · submitted 2026-05-28 · 💻 cs.LG · cs.SY· eess.SY

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

Yuanyi Wang , Yanggan Gu , Su Lu , Yifan Yang , Zhaoyi Yan , Congkai Xie , Jianmin Wu , Hongxia Yang This is my paper

Pith reviewed 2026-06-29 09:09 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords model mergingweight-space mergingexpert access setsI/O budgetingLLM mergingdelta blocksMergePipe

0 comments

The pith

MergePipe budgets expert weight reads to enable scalable LLM merging with order-of-magnitude I/O savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that at LLM scale the real constraint on weight-space merging is the I/O cost of reading expert checkpoints, not the algebraic operation itself. It formulates the task as choosing which expert delta blocks to access under an explicit budget, then constructs deterministic access plans that remain sound by construction. For fixed-coefficient additive merges these plans bound the error by the norm of the omitted deltas and recover the full merge when the budget is unlimited. Experiments on Qwen and Llama workloads show up to 10x less I/O, 11x speedups, and parameter deviation of order 10 to the minus 3 with no benchmark degradation. A reader would care because the approach makes repeated merging of large models feasible on ordinary hardware rather than requiring simultaneous loading of every checkpoint.

Core claim

MergePipe indexes parameter blocks, builds deterministic access plans under a stated I/O budget, and executes the resulting budgeted merge via replayable manifests. For fixed-coefficient additive operators the omitted-update error is bounded by the norm of the omitted deltas; the plan recovers the full-read merge exactly when the budget permits full access.

What carries the argument

The expert access-set problem that selects subsets of delta blocks under an explicit I/O budget while preserving merge semantics through deterministic, budget-sound plans.

If this is right

Expert-read I/O falls by up to an order of magnitude on representative Qwen and Llama merging workloads.
Execution time improves by up to 11 times compared with full-read baselines.
Parameter deviation from the full-read result stays at O(10^{-3}) across tested budgets.
Downstream benchmark scores exhibit no monotonic degradation even when the budget is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same access-planning layer could be applied to other weight-space operations once error bounds are derived for non-additive operators.
Access budgeting may become a standard middleware layer for any operation that must combine multiple large checkpoints.
Hardware with memory too small to hold all checkpoints simultaneously could still perform merges that previously required full simultaneous residency.

Load-bearing premise

The checkpoints occupy a shared weight coordinate system and the merge operators are fixed-coefficient additions of deltas.

What would settle it

A direct run on the same Qwen or Llama workload showing that the budgeted merge deviates by substantially more than 0.001 in parameter norm or produces consistent benchmark degradation relative to the full-access merge.

Figures

Figures reproduced from arXiv: 2605.29489 by Congkai Xie, Hongxia Yang, Jianmin Wu, Su Lu, Yanggan Gu, Yifan Yang, Yuanyi Wang, Zhaoyi Yan.

**Figure 1.** Figure 1: Budgeted access sets in weight space. Full-read merging fixes A = 1. MergePipe chooses a budget-feasible access mask A and executes the induced mask-aware operator Ψop; omitted entries are represented by the mask and do not trigger expert reads. 3. MergePipe MergePipe realizes the access-mask abstraction through a catalog–plan–execute loop; detailed algorithms are in Appendix A.2. Given a base checkpoint,… view at source ↗

**Figure 2.** Figure 2: Scaling with the number of experts. Full-read merging repeatedly scans expert checkpoints, so expert-read I/O and wall time grow with K. MergePipe enforces a fixed expert-I/O budget, keeping expert reads nearly flat and shifting the remaining cost toward the unavoidable checkpoint boundary. 10 20 30 40 50 60 70 80 90100 I/O budget (% of full expert read) 0 10 20 30 40 Expert read (GB) Dashed: budget cap (G… view at source ↗

**Figure 3.** Figure 3: Budget-aware planning behavior. (a) Realized expert reads grow monotonically with the requested I/O budget and remain under the cap. (b) End-to-end wall time follows expertread volume. (c) The fraction of accessed expert blocks expands smoothly as more budget is allocated. merging has near-linear expert-I/O growth and matching wall-time growth as K increases. MergePipe keeps the access set within B, maki… view at source ↗

**Figure 4.** Figure 4: MergePipe system overview. The runtime realizes budget-aware weight-space merging through block-level cataloging, access-set planning, mask-aware execution, and manifest-based replay. The planner controls expert-delta reads under the I/O budget, while the executor streams only selected expert blocks and materializes the resulting logical checkpoint. use coefficients αi,t,b, and let the selected-only budget… view at source ↗

**Figure 5.** Figure 5: Where MergePipe saves time. Top-left: planning, flush, and commit are small relative to execution. Top-right: tightening the budget primarily removes expert reads, while base reads and output writes remain nearly fixed. Bottom: before budgeting, expert reads scale with the number of experts. coordinate system and does not address permutation, symmetry, or representation alignment. Budgeted merging is app… view at source ↗

read the original abstract

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MergePipe frames merging as expert access-set budgeting with deterministic plans and an additive error bound, but the abstract supplies no derivations or protocol details.

read the letter

The core contribution is treating LLM merging as an explicit expert access-set budgeting problem rather than a pure algebraic one. MergePipe builds deterministic, replayable plans that stay within a given I/O budget, recover the full merge at full budget, and for fixed-coefficient additive operators bound the omitted-update error by the norm of the skipped deltas.

This is useful because at scale the real constraint is often reading the checkpoints, not the arithmetic. The reported numbers—an order-of-magnitude drop in expert reads, up to 11× speedups, O(10^{-3}) parameter deviation, and no monotonic benchmark drop—line up with that constraint on Qwen and Llama workloads.

The limitation is that the abstract states the bound and the construction but gives no derivation steps, no description of how the plans are actually indexed or chosen, and no experimental protocol or data rules. That makes it impossible to judge how tight the bound is or whether the results generalize beyond the tested operators and shared-coordinate assumption.

The work is aimed at practitioners who already run merging pipelines and hit I/O walls. It deserves a serious referee because the framing is new enough and the practical payoff is concrete enough to warrant checking the missing details, even if the current writeup is thin.

Referee Report

1 major / 2 minor

Summary. The paper claims that MergePipe, by casting LLM merging as an expert access-set problem and building deterministic access plans under an I/O budget, can reduce expert-read I/O by up to an order of magnitude and achieve up to 11× speedups on Qwen and Llama workloads. It states that the plans are budget-sound by construction, recover the full merge at full budget, and for fixed-coefficient additive operators the omitted-update error is bounded by the norm of omitted deltas, with empirical results showing O(10^{-3}) parameter deviation and no monotonic degradation on downstream benchmarks.

Significance. If the results hold, this work could make weight-space model merging more scalable by addressing the I/O bottleneck at LLM scale. The explicit construction of budget-sound plans and the error bound are notable strengths, as is the empirical demonstration on real models.

major comments (1)

[Abstract] The error bound for the omitted-update is stated but no derivation or proof is provided, which is central to validating the correctness of the budgeted merge for the restricted operator class.

minor comments (2)

[Abstract] The experimental protocol, including how budgets are swept and data-exclusion rules, is not described, making it difficult to reproduce the reported speedups and deviation results.
[Abstract] The specific merge operators and checkpoint families used in the Qwen and Llama experiments are not named, limiting the ability to assess the generality of the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential significance and for the constructive comment. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] The error bound for the omitted-update is stated but no derivation or proof is provided, which is central to validating the correctness of the budgeted merge for the restricted operator class.

Authors: We agree that the abstract states the bound without an accompanying derivation. The current manuscript provides only an informal justification in the methods section. In the revised version we will add an explicit proof (or detailed derivation) showing that, for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of the omitted deltas. This proof will be placed in the main text (likely as a new subsection or appendix reference) rather than left implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an explicit algorithmic construction

full rationale

The paper defines MergePipe as a budget-aware layer that builds deterministic access plans which are budget-sound by construction and recover the full merge at full budget. This is a definitional property of the proposed execution layer rather than a derived claim. The error bound for fixed-coefficient additive operators is stated directly from the norm of omitted deltas under the shared-coordinate assumption. Empirical results (I/O reduction, speedups, parameter deviation) are reported as measurements on Qwen/Llama workloads, not as predictions obtained by fitting to the same data. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation chain is therefore self-contained as an engineering construction with explicitly stated operating assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full paper text unavailable, so ledger is necessarily incomplete. The shared coordinate system and fixed-coefficient additive operator class are the only explicit premises visible.

axioms (1)

domain assumption Checkpoints reside in a shared weight coordinate system.
Required for the merge operator to be well-defined across experts.

invented entities (1)

MergePipe no independent evidence
purpose: Budget-aware execution layer that indexes blocks and builds deterministic access plans.
New system introduced by the paper; no external evidence supplied.

pith-pipeline@v0.9.1-grok · 5738 in / 1266 out tokens · 35435 ms · 2026-06-29T09:09:10.764544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InPro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2368–2378,

2019
[3]

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Gu, Y ., Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Wu, F., and Yang, H. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

FeatCal: Feature Calibration for Post-Merging Models

Gu, Y ., Cai, S., Wang, Z., Wang, W., Wang, Y ., Wang, P., Huang, S., Lu, S., Wu, J., and Yang, H. Featcal: Fea- ture calibration for post-merging models.arXiv preprint arXiv:2605.13030,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a

Liu, S., Wu, H., He, B., Han, X., Yuan, M., and Song, L. Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a. Liu, Z., Wu, H., Yao, Y ., She, R., Han, X., Zhong, T., and Yuan, M. Lore-merging: Exploring low-rank estima- tion for large language model merging.arXiv preprint arXiv:25...

work page arXiv
[6]

H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N

Nobari, A. H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421,

work page arXiv
[7]

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

Wang, P., Liu, P., Wang, Y ., Chen, G., Ren, X., Li, X., Hao, Z., Kong, Y ., Zhang, Q., and Ni, D. Discovering physical directions in weight space: Composing neural pde experts.arXiv preprint arXiv:2605.14546, 2026a. Wang, W., Gu, Y ., Cai, S., Wang, Y ., Wang, P., Wu, J., and Yang, H. E-pmq: Expert-guided post-merge quan- tization with merged-weight anch...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Wang, Y ., Lu, S., Gu, Y ., Wang, P., Yang, Y ., Yan, Z., Xie, C., Wu, J., and Yang, H. Not all disagreement is learnable: Token teachability in on-policy distillation, 2026c. Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Gu, Y ., Wu, F., and Yang, H. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion.Advances in Neura...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing

I/O Breakdown and Overhead.Figure 5 shows that the gains come from reducing the expert-read term rather than from metadata effects. Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing. C. Limitations MergePipe targets budgeted weight-space access for check- p...

2025

[1] [1]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InPro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2368–2378,

2019

[3] [3]

InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Gu, Y ., Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Wu, F., and Yang, H. Infifpo: Implicit model fusion via preference optimization in large language models.arXiv preprint arXiv:2505.13878,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

FeatCal: Feature Calibration for Post-Merging Models

Gu, Y ., Cai, S., Wang, Z., Wang, W., Wang, Y ., Wang, P., Huang, S., Lu, S., Wu, J., and Yang, H. Featcal: Fea- ture calibration for post-merging models.arXiv preprint arXiv:2605.13030,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a

Liu, S., Wu, H., He, B., Han, X., Yuan, M., and Song, L. Sens-merging: Sensitivity-guided parameter balanc- ing for merging large language models.arXiv preprint arXiv:2502.12420, 2025a. Liu, Z., Wu, H., Yao, Y ., She, R., Han, X., Zhong, T., and Yuan, M. Lore-merging: Exploring low-rank estima- tion for large language model merging.arXiv preprint arXiv:25...

work page arXiv

[6] [6]

H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N

Nobari, A. H., Alim, K., ArjomandBigdeli, A., Srivastava, A., Ahmed, F., and Azizan, N. Activation-informed merging of large language models.arXiv preprint arXiv:2502.02421,

work page arXiv

[7] [7]

Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

Wang, P., Liu, P., Wang, Y ., Chen, G., Ren, X., Li, X., Hao, Z., Kong, Y ., Zhang, Q., and Ni, D. Discovering physical directions in weight space: Composing neural pde experts.arXiv preprint arXiv:2605.14546, 2026a. Wang, W., Gu, Y ., Cai, S., Wang, Y ., Wang, P., Wu, J., and Yang, H. E-pmq: Expert-guided post-merge quan- tization with merged-weight anch...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Wang, Y ., Lu, S., Gu, Y ., Wang, P., Yang, Y ., Yan, Z., Xie, C., Wu, J., and Yang, H. Not all disagreement is learnable: Token teachability in on-policy distillation, 2026c. Wang, Y ., Yan, Z., Zhang, Y ., Zhou, Q., Gu, Y ., Wu, F., and Yang, H. Infigfusion: Graph-on-logits distillation via efficient gromov-wasserstein for model fusion.Advances in Neura...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing

I/O Breakdown and Overhead.Figure 5 shows that the gains come from reducing the expert-read term rather than from metadata effects. Base reads and output writes are checkpoint-boundary costs, while planning and transac- tional overhead remain small compared with tensor stream- ing. C. Limitations MergePipe targets budgeted weight-space access for check- p...

2025