Recognition: unknown
Temporally Extended Mixture-of-Experts Models
Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3
The pith
Mixture-of-experts models can switch experts only rarely by framing selection as options in reinforcement learning with deliberation costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying the option-critic framework with deliberation costs to gpt-oss-20b using low-rank adapters and self-distillation, the method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability.
What carries the argument
The option-critic framework with deliberation costs, which trains a controller per MoE layer to select temporally extended expert sets rather than token-by-token choices.
If this is right
- Memory optimizations such as expert offloading and prefetching become practical for models larger than single-GPU memory.
- Model trainers gain an explicit knob, the deliberation cost, to trade lower switching frequency against task performance.
- Existing pre-trained MoE checkpoints can be converted to the temporally extended form using only low-rank adapters and a distillation reward.
- The same per-layer controller structure supports continual learning by updating only the switching policy as new data arrives.
Where Pith is reading between the lines
- The approach may extend to other sparse architectures where activation patterns are currently recomputed every token.
- Lower switch rates could reduce communication overhead in multi-node inference setups without changing the underlying experts.
- Controllers learned this way might serve as a starting point for further specialization on narrow domains.
Load-bearing premise
A lightweight controller trained with self-distillation on frozen base weights can learn stable low-frequency switching policies without full retraining or original data access.
What would settle it
Run the trained controller on MATH, MMLU, and MMMLU and measure the realized expert switch rate; if the average remains above 5 percent while accuracy stays near the reported level, the reduction claim does not hold.
read the original abstract
Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes temporally extended Mixture-of-Experts (MoE) models by integrating the option-critic framework from reinforcement learning. A per-layer controller is added to decide when to switch expert sets; it is trained with low-rank adapters (LoRA) and a self-distillation reward on the frozen gpt-oss-20b weights. The central empirical claim is that this reduces expert switch rates from >50% to <5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU, with deliberation costs providing an explicit trade-off between switching frequency and capability.
Significance. If the results are robust, the work offers a principled, lightweight route to memory-efficient inference for large MoE models by reducing expert churn and enabling better offloading/prefetching. The direct use of the options framework to impose temporal structure on MoE routing is a novel connection, and the ability to retrofit pre-trained models without full retraining is practically valuable for continual learning. The deliberation-cost mechanism supplies a clean hyperparameter for practitioners.
major comments (3)
- [Abstract and Experimental Results] The abstract and results section report switch rates <5% and 90% accuracy retention, yet supply no variance across runs, no comparison to simple baselines (e.g., fixed-expert or random-switching policies), and no ablation removing the deliberation cost. These omissions make it impossible to determine whether the option-critic controller, rather than capacity reduction, drives the reported gains.
- [Training Procedure] The self-distillation reward matches only the final model outputs and supplies no direct supervision on per-token expert activation. Consequently, nothing in the training objective prevents the controller from achieving low switch rates by repeatedly selecting a small dominant expert subset (as noted in the stress-test concern). A diagnostic measuring expert-set diversity or utilization entropy across tokens is required to confirm that temporally extended options, rather than collapse, are learned.
- [Method] The adaptation of the option-critic framework (intra-option policies, termination functions, and deliberation costs) to MoE layers is described at a high level but lacks the explicit loss or policy-gradient expressions used for the controller. Without these, it is difficult to verify that the temporal-extension mechanism is correctly instantiated and that the reported behavior follows from the framework rather than from the LoRA + distillation setup alone.
minor comments (1)
- [Method] Notation for the controller's state and action spaces is introduced without a clear diagram or table relating them to the underlying MoE routing variables.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We have carefully considered each comment and made revisions to address the concerns about experimental rigor, training diagnostics, and methodological clarity.
read point-by-point responses
-
Referee: [Abstract and Experimental Results] The abstract and results section report switch rates <5% and 90% accuracy retention, yet supply no variance across runs, no comparison to simple baselines (e.g., fixed-expert or random-switching policies), and no ablation removing the deliberation cost. These omissions make it impossible to determine whether the option-critic controller, rather than capacity reduction, drives the reported gains.
Authors: We agree with the referee that reporting variance, including baseline comparisons, and performing an ablation on the deliberation cost would strengthen the claims. In the revised version, we now report results with standard deviations over multiple random seeds, compare against fixed-expert and random-switching policies (showing superior switch rate reduction without proportional accuracy loss), and include an ablation study demonstrating that removing deliberation costs leads to higher switching rates. These additions confirm that the option-critic framework contributes to the observed behavior beyond mere capacity reduction. revision: yes
-
Referee: [Training Procedure] The self-distillation reward matches only the final model outputs and supplies no direct supervision on per-token expert activation. Consequently, nothing in the training objective prevents the controller from achieving low switch rates by repeatedly selecting a small dominant expert subset (as noted in the stress-test concern). A diagnostic measuring expert-set diversity or utilization entropy across tokens is required to confirm that temporally extended options, rather than collapse, are learned.
Authors: The concern about potential collapse to a dominant expert subset is well-taken, as the self-distillation objective focuses on output matching. To mitigate this and provide evidence against collapse, we have added diagnostics in the revised manuscript, including per-layer expert utilization entropy and diversity measures across tokens. These metrics indicate that the controller learns temporally extended options with diverse expert sets rather than collapsing, maintaining entropy levels comparable to the base model. We also reference the stress-test results to show robustness. revision: yes
-
Referee: [Method] The adaptation of the option-critic framework (intra-option policies, termination functions, and deliberation costs) to MoE layers is described at a high level but lacks the explicit loss or policy-gradient expressions used for the controller. Without these, it is difficult to verify that the temporal-extension mechanism is correctly instantiated and that the reported behavior follows from the framework rather than from the LoRA + distillation setup alone.
Authors: We appreciate the need for explicit formulations to allow verification. The revised method section now includes the detailed loss functions and policy-gradient expressions for the controller, specifically the gradients for the termination function, intra-option policy updates, and the integration of deliberation costs into the advantage estimation. This makes clear how the option-critic framework is adapted to MoE routing and separates its contribution from the LoRA adapters and self-distillation. revision: yes
Circularity Check
No circularity; empirical application of external RL framework
full rationale
The paper presents an engineering application of the established option-critic framework (with deliberation costs) to MoE layers, using LoRA adapters and a self-distillation reward defined against the external base model. Reported outcomes (switch-rate reduction to <5% and up to 90% accuracy retention on MATH/MMLU/MMMLU) are experimental measurements, not quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown that reduce the central claims to the inputs; the self-distillation signal and deliberation-cost trade-off are independent of the target metrics. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The option-critic framework with deliberation costs can be applied to per-layer expert routing without changing the underlying transformer architecture.
Forward citations
Cited by 3 Pith papers
-
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
-
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO uses document-boundary expert pooling during pretraining to induce emergent semantic modularity in MoE models, allowing 25% expert retention with only 1% performance drop.
-
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.
Reference graph
Works this paper leans on
-
[1]
The option-critic architecture, 2016
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture, 2016. URL https: //arxiv.org/abs/1609.05140
-
[2]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text- to-image diffusion models with an ensemble of expert denoisers, 2023. URL https://arxiv.org/ abs/2211.01324
work page internal anchor Pith review arXiv 2023
-
[3]
A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20,
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20,
-
[4]
A Survey on Mixture of Experts in Large Language Models , ISSN=
ISSN 2326-3865. doi: 10.1109/tkde.2025.3554028. URL http://dx.doi.org/10.1109/ TKDE.2025.3554028
-
[5]
Ma-rlhf: Reinforcement learning from human feedback with macro actions, 2025
Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, and Hua Wu. Ma-rlhf: Reinforcement learning from human feedback with macro actions, 2025. URL https://arxiv.org/abs/2410. 02743
2025
-
[6]
Unified scaling laws for routed language models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational conference on machine learning, pages 4057–4086. PMLR, 2022
2022
-
[7]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Enhancing math reasoning in small-sized llms via preview difficulty-aware intervention, 2025
Xinhan Di and JoyJiaoW. Enhancing math reasoning in small-sized llms via preview difficulty-aware intervention, 2025. URLhttps://arxiv.org/abs/2508.01604
-
[9]
Fast inference of mixture-of-experts language models with offloading,
Artyom Eliseev and Denis Mazur. Fast inference of mixture-of-experts language models with offloading,
- [10]
-
[11]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URLhttps://arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2022
-
[12]
Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Li Chen, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts, 2023. URLhttps://arxiv.org/abs/2210.15257
-
[13]
Gemini-Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...
work page internal anchor Pith review arXiv 2026
-
[15]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026. URLhttps://arxiv.org/abs/2306.08543
work page internal anchor Pith review arXiv 2026
-
[16]
When waiting is not an option : Learning options with a deliberation cost, 2017
Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option : Learning options with a deliberation cost, 2017. URLhttps://arxiv.org/abs/1709.04571
-
[17]
arXiv preprint arXiv:2407.04153 , year=
Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024
-
[18]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/ abs/2009.03300
work page internal anchor Pith review arXiv 2021
-
[19]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874
work page internal anchor Pith review arXiv 2021
-
[20]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv. org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Flexible option learning, 2021
Martin Klissarov and Doina Precup. Flexible option learning, 2021. URL https://arxiv.org/abs/ 2112.03097
-
[23]
Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, and João Sacramento. Emergent temporal abstractions in autoregressive models enable hierarchical reinforce...
-
[24]
Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs, 2024. URLhttps://arxiv.org/abs/2407.00945
-
[25]
On-policy distillation, 2025
Kevin Lu. On-policy distillation, 2025. URL https://thinkingmachines.ai/blog/ on-policy-distillation/
2025
-
[26]
Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,
- [27]
-
[28]
Seer-moe: Sparse expert efficiency through regularization for mixture-of-experts,
Alexandre Muzio, Alex Sun, and Churan He. Seer-moe: Sparse expert efficiency through regularization for mixture-of-experts, 2024. URLhttps://arxiv.org/abs/2404.05089. 16 Temporally Extended Mixture-of-Experts Models
-
[29]
Nemotron-Post- Training-Dataset-v2, August 2025
Dhruv Nathawani, Shuoyang Ding, Vitaly Lavrukhin, Igor Gitman, Somshubra Majum- dar, Evelina Bakhturina, Boris Ginsburg, and Jane Polak Scowcroft. Nemotron-Post- Training-Dataset-v2, August 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-Post-Training-Dataset-v2
2025
-
[30]
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...
work page internal anchor Pith review arXiv 2025
-
[31]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952
work page internal anchor Pith review arXiv 2023
-
[32]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5
2026
-
[33]
Qwen3-next: Towards ultimate training and inference efficiency, 2025
QwenTeam. Qwen3-next: Towards ultimate training and inference efficiency, 2025. URL https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from= research.latest-advancements-list
2025
-
[34]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018. URL https://arxiv.org/abs/ 1506.02438
work page internal anchor Pith review arXiv 2018
-
[35]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URL https://arxiv.org/abs/1701.06538
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Promoe: Fast moe-based llm serving using proactive caching, 2025
Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. Promoe: Fast moe-based llm serving using proactive caching, 2025. URLhttps://arxiv.org/abs/2410.22134. 17 Temporally Extended Mixture-of-Experts Models
-
[37]
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URLhttps://arxiv.org/abs/2306.11695
-
[38]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1):181–211, 1999. ISSN 0004-
1999
-
[39]
DOI: https://doi.org/10.1016/S0004-3702(99)00052-1
doi: https://doi.org/10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect. com/science/article/pii/S0004370299000521
-
[40]
emoe: Task-aware memory efficient mixture-of-experts-based (moe) model inference, 2025
Suraiya Tairin, Shohaib Mahmud, Haiying Shen, and Anand Iyer. emoe: Task-aware memory efficient mixture-of-experts-based (moe) model inference, 2025. URL https://arxiv.org/abs/2503. 06823
2025
-
[41]
TRL: Transformers Reinforcement Learning,
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning,
-
[42]
URLhttps://github.com/huggingface/trl
-
[43]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, and An Xu. Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router, 2024. URL https://arxiv.org/abs/2410.12013
-
[45]
Moe-infinity: Efficient moe inference on personal machines with sparsity-aware expert cache, 2025
Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Efficient moe inference on personal machines with sparsity-aware expert cache, 2025. URL https://arxiv.org/abs/2401. 14361
2025
-
[46]
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexan- der Smola. Deep sets, 2018. URLhttps://arxiv.org/abs/1703.06114
work page Pith review arXiv 2018
-
[47]
DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
Yuning Zhang, Grant Pinkert, Nan Yang, Yanli Li, and Dong Yuan. Duoserve-moe: Dual-phase expert prefetch and cache scheduling for efficient moe llm inference, 2025. URL https://arxiv.org/ abs/2509.07379. 18 Temporally Extended Mixture-of-Experts Models A1 Related Works Mixture-of-Experts Models.MoE architectures have become the dominant paradigm for scali...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
So 2 leftover? That would mean we had 21 more? Let’s compute again: 403200/27 =3150
14/7=2 ? Wait 14/7=2 . So 2 leftover? That would mean we had 21 more? Let’s compute again: 403200/27 =3150 . 3150/32 =3150/9=350 . 350/52 =350/25=14 . 14/7=2 . So leftover 21. So actual factorization: 28? Let’s double-check: 27 ·3 2 ·5 2 ·7 ? That would produce 27 ·3 2 ·5 2 ·7=128·9·25·7=128·9=1152 ; 1152·25=28800 ; 28800·7=201600 ? Wait 1152·25=28800;288...
-
[49]
6!+7!=720+5040=5760
5!+6!=120+720=840 . 6!+7!=720+5040=5760 . 7!+8!=5040+40320=90720 . 8!+9!=40320+3628800=3628800. The sums:3, 8, 30, 144, 840, 5760, 90720, 3628800. We need to find the least common multiple of these numbers: LCM(3, 8, 30, 144, 840, 5760, 90720, 3628800). Let’s compute the LCM. We can compute the prime factorization of each number and find the LCM. Let’s do...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.