pith. sign in

arxiv: 2605.27081 · v1 · pith:NE7W423Inew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.DC

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Pith reviewed 2026-06-29 18:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords mixture of expertsrouter fine-tuningexpert reusememory-constrained inferenceLLM inferenceMoE optimization
0
0 comments X

The pith

Fine-tuning the MoE router to favor recent experts increases reuse by 26% without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts models activate only a few experts per token to save computation. In memory-limited inference, keeping only a small cache means many experts must be loaded from slow storage, hurting speed. ReMoE fine-tunes the router so it tends to pick experts that were chosen recently, making the sequence of choices more stable and better suited to the cache. This raises reuse rates and cuts fetches while task performance stays level. The gains appear in both throughput on server setups and speed on embedded hardware.

Core claim

ReMoE applies fine-tuning to the router to bias it toward recently selected experts. The resulting temporally stable routing decisions better match the locality needs of a small expert cache, lowering the frequency of loads from external storage during inference on models such as DeepSeek and Qwen.

What carries the argument

Router fine-tuning objective that encourages selection of recently used experts to promote temporal stability in routing.

If this is right

  • Expert reuse improves by 26 percent across tested models.
  • Downstream task performance remains unchanged.
  • Output throughput rises 8.4 percent under vLLM with GPU-CPU offloading.
  • Decode time per output token drops 43.6-49.8 percent on Jetson Orin NX, for 1.77-1.99x speedup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fine-tuning could be applied to other caching policies beyond simple LRU in MoE systems.
  • Edge deployments of large MoE models become more feasible if reuse gains hold across hardware.
  • The approach may extend to reducing communication costs in distributed MoE inference.

Load-bearing premise

Fine-tuning the router for short-horizon reuse leaves its selection quality intact on unseen or long inputs.

What would settle it

Measuring a decline in accuracy on held-out benchmarks or a failure to increase reuse on long-context prompts would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.27081 by Liang Wang, Limin Xiao, Tianyang Jiang, Xiaojian Liao, Xiongwei Zhu, Yusen Zhang.

Figure 1
Figure 1. Figure 1: Comparison of expert I/O dynamics. Baseline: Stan￾dard routers select disjoint experts across steps, causing frequent cache replacements and I/O thrashing. ReMoE: Our method en￾courages temporal locality by increasing expert reuse across adja￾cent steps, thereby reducing cache turnover and I/O overhead. temporal locality loss that encourages expert reuse across adjacent tokens, and (ii) a Trust-KL loss tha… view at source ↗
Figure 2
Figure 2. Figure 2: Routing trajectories under teacher forcing (21st MoE layer of DeepSeek-V2-Lite). We trace Top-K expert indices over decoding steps for a fixed token sequence. The baseline already shows short stretches of potential reuse but exhibits frequent switching, while ReMoE—via gate-only fine-tuning—extends these streaks and stabilizes the working set, increasing short-horizon overlap. With inputs fixed, the differ… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of ReMoE (per MoE layer). A frozen snapshot gate provides P ref t to anchor semantics, while a trainable gate is optimized with temporal-locality regularization using a short routing history buffer; only gate parameters are updated. cause it explicitly encourages expert dispersion, which con￾flicts with cache locality under memory constraints. Our full objective is: L = LCE + λKL LTrust + αt LLoc,… view at source ↗
Figure 4
Figure 4. Figure 4: Routing trajectory of DeepSeek-V2-Lite-Chat under teacher forcing (layer 21). Each point marks one of the Top-K experts chosen at the corresponding decoding step. The chat-tuned model spreads its routing decisions across a wide set of experts and switches selection nearly every step, without the extended horizontal reuse streaks that ReMoE produces on the base model in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to λreuse and λKL. Increasing λreuse improves reuse (EOR; reported as eval reuse) at the cost of larger trust deviation (eval trust kl), while increasing λKL reduces drift but also constrains reuse. Metric Default Dmain Short Dshort Ratio (Short / Default) PPL (eval ppl) 3.2280 3.2298 1.0006 Acc@1 (eval acc1) 71.775 71.781 1.0001 Reuse (eval reuse) 0.3453 0.3440 0.9963 Trust dev. (eval trust kl… view at source ↗
read the original abstract

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ReMoE, a router fine-tuning approach for Mixture-of-Experts LLMs that adds a bias toward recently used experts to increase token-wise reuse under memory constraints. It reports a 26% improvement in expert reuse on DeepSeek and Qwen models while preserving downstream task performance, along with measured gains of 8.4% throughput under vLLM GPU-CPU offloading and 1.77-1.99× decode speedup (via 43.6-49.8% TPOT reduction) on Jetson Orin NX with llama.cpp.

Significance. If the core assumption holds, the work addresses a practical bottleneck in deploying large MoE models on memory-limited hardware by improving cache locality without runtime overhead. The real-system results on two platforms and two model families add practical value, and the public release of checkpoints strengthens reproducibility. However, the absence of key methodological details and targeted ablations limits assessment of broader applicability.

major comments (3)
  1. [§3] §3 (method): The fine-tuning objective and bias mechanism toward recently selected experts are described only at a high level; the precise loss formulation, temporal horizon, and how the bias is added to the router logits without altering inference compute are unspecified. This is load-bearing for the 26% reuse claim, as it prevents determining whether the gain arises from the proposed mechanism or from training data characteristics.
  2. [§4-5] §4-5 (experiments): Downstream performance is reported as maintained on standard benchmarks, but no ablation or evaluation on long-context inputs or out-of-distribution tasks is described. The short-horizon reuse objective could override context-dependent signals on sequences exceeding the fine-tuning horizon, directly risking the central claim that performance is preserved under the same conditions where reuse gains are measured.
  3. [Abstract and §4] Abstract and §4: No information is provided on the fine-tuning dataset, hyperparameters, number of random seeds, or statistical significance testing for the 26% reuse figure. This makes it impossible to assess whether workload selection was post-hoc or whether the reported gains are robust, undermining confidence in the empirical results.
minor comments (2)
  1. [Abstract] Abstract: The relationship between the 43.6-49.8% TPOT reduction and the 1.77-1.99× decode speedup should be clarified explicitly, including any assumptions about batching or sequence length.
  2. The manuscript would benefit from additional citations to prior work on cache-aware MoE inference and temporal routing stability to better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments below, agreeing to provide additional details and ablations in a revised manuscript where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (method): The fine-tuning objective and bias mechanism toward recently selected experts are described only at a high level; the precise loss formulation, temporal horizon, and how the bias is added to the router logits without altering inference compute are unspecified. This is load-bearing for the 26% reuse claim, as it prevents determining whether the gain arises from the proposed mechanism or from training data characteristics.

    Authors: We acknowledge that the description in the manuscript is high-level. We will revise §3 to provide the precise loss formulation, specify the temporal horizon, and detail how the bias is incorporated into the router logits during fine-tuning only, ensuring no additional inference compute. This revision will make it clear that the 26% reuse improvement stems from the proposed bias mechanism. revision: yes

  2. Referee: [§4-5] §4-5 (experiments): Downstream performance is reported as maintained on standard benchmarks, but no ablation or evaluation on long-context inputs or out-of-distribution tasks is described. The short-horizon reuse objective could override context-dependent signals on sequences exceeding the fine-tuning horizon, directly risking the central claim that performance is preserved under the same conditions where reuse gains are measured.

    Authors: We note that the current experiments focus on standard benchmarks to demonstrate maintained performance. However, to directly address the potential issue with long-context inputs, we will add ablations and evaluations on long-context tasks in the revised version to confirm that the short-horizon objective does not degrade performance on sequences longer than the fine-tuning horizon. revision: yes

  3. Referee: [Abstract and §4] Abstract and §4: No information is provided on the fine-tuning dataset, hyperparameters, number of random seeds, or statistical significance testing for the 26% reuse figure. This makes it impossible to assess whether workload selection was post-hoc or whether the reported gains are robust, undermining confidence in the empirical results.

    Authors: We agree that the manuscript should include more details on the experimental setup. We will update the abstract and §4 to specify the fine-tuning dataset, list the hyperparameters, report the number of random seeds, and include statistical significance testing for the reported improvements. The public release of checkpoints already aids reproducibility, but these additions will further strengthen the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains from router fine-tuning are measured outcomes, not identities by construction

full rationale

The paper proposes ReMoE as a fine-tuning method that adds a bias toward recently-used experts to improve cache locality in memory-constrained MoE inference. The 26% reuse improvement, 8.4% throughput gain, and 1.77-1.99x speedup are reported as direct experimental measurements on DeepSeek/Qwen models and real hardware (vLLM, llama.cpp on Jetson), not as quantities algebraically forced by the fine-tuning objective or by any self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that would reduce the central claims to the inputs; the method and its evaluation remain independent of the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that a modest bias toward recent experts can be introduced via fine-tuning without harming long-term routing quality; no new entities or free parameters beyond standard fine-tuning hyperparameters are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5792 in / 1125 out tokens · 21038 ms · 2026-06-29T18:56:01.951853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Training Verifiers to Solve Math Word Problems

    URL https://arxiv.org/abs/2107.0 3374. Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. EfficientQAT: Efficient quantization- aware training for large language models. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: L...

  2. [2]

    Du, H., Wu, S., Kharlamova, A., Guan, N., and Xue, C

    URL https://openreview.net/forum ?id=QV79qiKAjD. Du, H., Wu, S., Kharlamova, A., Guan, N., and Xue, C. J. Flexinfer: Breaking memory constraint via flexible and efficient offloading for on-device llm inference, 2025. URLhttps://arxiv.org/abs/2503.03777. Du, N., Huang, Y ., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y ., Krikun, M., Zhou, Y ., Yu, A. W., Fira...

  3. [3]

    URL https://aclanthology.org/2025.findin gs-acl.997/

    doi: 10.18653/v1/2025.findings-acl.997. URL https://aclanthology.org/2025.findin gs-acl.997/. Jiang, X., Zhou, Y ., Cao, S., Stoica, I., and Yu, M. Neo: Saving gpu memory crisis with cpu offloading for online llm inference, 2024. URL https://arxiv.org/ab s/2411.01142. Kamahori, K., Tang, T., Gu, Y ., Zhu, K., and Kasikci, B. Fiddler: CPU-GPU orchestration...

  4. [4]

    Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., and Alvarez, J

    URL https://openreview.net/forum ?id=B1ckMDqlg. Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., and Alvarez, J. M. Structural pruning via latency-saliency knapsack, 2022. URL https://arxiv.org/abs/ 2210.06659. Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y ., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., Liang, P., R´e, C., Stoica,...

  5. [5]

    ACM Comput

    URL https://arxiv.org/abs/2408.1 5664. Wang, X., Tang, Z., Guo, J., Meng, T., Wang, C., Wang, T., and Jia, W. Empowering edge intelligence: A compre- hensive survey on on-device ai models.ACM Computing Surveys, 57(9):1–39, April 2025. ISSN 1557-7341. doi: 10.1145/3724420. URL http://dx.doi.org/10. 1145/3724420. Xiao, G., Lin, J., Seznec, M., Wu, H., Demou...

  6. [6]

    For instance, if K= 6 and C= 4 , even if Et =E t−1, the cache cannot hold all 6 experts simultaneously, so a guarantee of the form(17) no longer holds

    If C < K , misses are unavoidable.When capacity is insufficient to hold a full routed set, then regardless of routing overlap, at least K−C experts must be missing at every step. For instance, if K= 6 and C= 4 , even if Et =E t−1, the cache cannot hold all 6 experts simultaneously, so a guarantee of the form(17) no longer holds. This is why we restrict toC≥K

  7. [7]

    Then even if Et−1 was fully loaded during step t−1 , some of these experts might be evicted before step t begins, and the containment in Lemma A.5 can fail

    If cache isolation is violated, Et−1 ⊆ C t may fail.Suppose expert weights share the same memory pool with other objects that can evict experts between steps (e.g., KV-cache expansion, activation buffers, or a different layer/request). Then even if Et−1 was fully loaded during step t−1 , some of these experts might be evicted before step t begins, and the...

  8. [8]

    For example, with C=K , any insertion of an expert not in Et−1 forces an eviction; if the policy/prefetcher evicts from Et−1, then Et−1 ⊈C t

    Inter-step insertions (e.g., aggressive prefetch) can break the guarantee.If a prefetcher inserts additional experts into the cache between step t−1 and t and triggers evictions, it may evict members of Et−1 even under C≥K . For example, with C=K , any insertion of an expert not in Et−1 forces an eviction; if the policy/prefetcher evicts from Et−1, then E...

  9. [9]

    Non-admitting or unusual policies.Assumption A.2 requires that the cache admits the experts it just served (when C≥K ). Pathological policies that can evict an expert immediately after it is fetched/used (within the same step), or that refuse to retain the requested set, can violate (15) and invalidate the proof. Such policies are atypical for expert-weig...