arxiv: 2604.20156 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Temporally Extended Mixture-of-Experts Models

Zeyu Shen , Peter Henderson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsoption-critic frameworkdeliberation coststemporal extensionlow-rank adaptersself-distillationexpert switchingmodel serving

0 comments

The pith

Mixture-of-experts models can switch experts only rarely by framing selection as options in reinforcement learning with deliberation costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard MoE layers switch experts at nearly every token, which prevents useful memory optimizations once models exceed GPU capacity. It proposes adding a per-layer controller trained under the option-critic framework so that entire sets of experts stay active for extended periods. When applied to an existing 20-billion-parameter model via low-rank adapters and a self-distillation reward, the approach lowers average switch rates from over 50 percent to below 5 percent while keeping up to 90 percent of the original accuracy on MATH, MMLU, and MMMLU. A reader would care because the change turns high-churn MoEs into memory-efficient ones without requiring full retraining or original data.

Core claim

By applying the option-critic framework with deliberation costs to gpt-oss-20b using low-rank adapters and self-distillation, the method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability.

What carries the argument

The option-critic framework with deliberation costs, which trains a controller per MoE layer to select temporally extended expert sets rather than token-by-token choices.

If this is right

Memory optimizations such as expert offloading and prefetching become practical for models larger than single-GPU memory.
Model trainers gain an explicit knob, the deliberation cost, to trade lower switching frequency against task performance.
Existing pre-trained MoE checkpoints can be converted to the temporally extended form using only low-rank adapters and a distillation reward.
The same per-layer controller structure supports continual learning by updating only the switching policy as new data arrives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other sparse architectures where activation patterns are currently recomputed every token.
Lower switch rates could reduce communication overhead in multi-node inference setups without changing the underlying experts.
Controllers learned this way might serve as a starting point for further specialization on narrow domains.

Load-bearing premise

A lightweight controller trained with self-distillation on frozen base weights can learn stable low-frequency switching policies without full retraining or original data access.

What would settle it

Run the trained controller on MATH, MMLU, and MMMLU and measure the realized expert switch rate; if the average remains above 5 percent while accuracy stays near the reported level, the reduction claim does not hold.

read the original abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ports the option-critic framework to MoE routing with per-layer controllers and self-distillation, cutting switch rates sharply on a frozen 20B model while keeping most accuracy, but the gains may come from expert collapse rather than genuine temporal options.

read the letter

The main point is that they train a lightweight controller at each layer to decide when to switch expert sets and which set to load, using a deliberation cost to discourage frequent changes. On gpt-oss-20b with LoRA adapters and a self-distillation reward from the frozen base, they report switch rates dropping below 5% from over 50%, with up to 90% accuracy retained on MATH, MMLU, and MMMLU. This is a direct transfer of the options idea to MoE, and the per-layer design plus the cost term are the concrete engineering additions that make it work on existing models without full retraining.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes temporally extended Mixture-of-Experts (MoE) models by integrating the option-critic framework from reinforcement learning. A per-layer controller is added to decide when to switch expert sets; it is trained with low-rank adapters (LoRA) and a self-distillation reward on the frozen gpt-oss-20b weights. The central empirical claim is that this reduces expert switch rates from >50% to <5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU, with deliberation costs providing an explicit trade-off between switching frequency and capability.

Significance. If the results are robust, the work offers a principled, lightweight route to memory-efficient inference for large MoE models by reducing expert churn and enabling better offloading/prefetching. The direct use of the options framework to impose temporal structure on MoE routing is a novel connection, and the ability to retrofit pre-trained models without full retraining is practically valuable for continual learning. The deliberation-cost mechanism supplies a clean hyperparameter for practitioners.

major comments (3)

[Abstract and Experimental Results] The abstract and results section report switch rates <5% and 90% accuracy retention, yet supply no variance across runs, no comparison to simple baselines (e.g., fixed-expert or random-switching policies), and no ablation removing the deliberation cost. These omissions make it impossible to determine whether the option-critic controller, rather than capacity reduction, drives the reported gains.
[Training Procedure] The self-distillation reward matches only the final model outputs and supplies no direct supervision on per-token expert activation. Consequently, nothing in the training objective prevents the controller from achieving low switch rates by repeatedly selecting a small dominant expert subset (as noted in the stress-test concern). A diagnostic measuring expert-set diversity or utilization entropy across tokens is required to confirm that temporally extended options, rather than collapse, are learned.
[Method] The adaptation of the option-critic framework (intra-option policies, termination functions, and deliberation costs) to MoE layers is described at a high level but lacks the explicit loss or policy-gradient expressions used for the controller. Without these, it is difficult to verify that the temporal-extension mechanism is correctly instantiated and that the reported behavior follows from the framework rather than from the LoRA + distillation setup alone.

minor comments (1)

[Method] Notation for the controller's state and action spaces is introduced without a clear diagram or table relating them to the underlying MoE routing variables.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We have carefully considered each comment and made revisions to address the concerns about experimental rigor, training diagnostics, and methodological clarity.

read point-by-point responses

Referee: [Abstract and Experimental Results] The abstract and results section report switch rates <5% and 90% accuracy retention, yet supply no variance across runs, no comparison to simple baselines (e.g., fixed-expert or random-switching policies), and no ablation removing the deliberation cost. These omissions make it impossible to determine whether the option-critic controller, rather than capacity reduction, drives the reported gains.

Authors: We agree with the referee that reporting variance, including baseline comparisons, and performing an ablation on the deliberation cost would strengthen the claims. In the revised version, we now report results with standard deviations over multiple random seeds, compare against fixed-expert and random-switching policies (showing superior switch rate reduction without proportional accuracy loss), and include an ablation study demonstrating that removing deliberation costs leads to higher switching rates. These additions confirm that the option-critic framework contributes to the observed behavior beyond mere capacity reduction. revision: yes
Referee: [Training Procedure] The self-distillation reward matches only the final model outputs and supplies no direct supervision on per-token expert activation. Consequently, nothing in the training objective prevents the controller from achieving low switch rates by repeatedly selecting a small dominant expert subset (as noted in the stress-test concern). A diagnostic measuring expert-set diversity or utilization entropy across tokens is required to confirm that temporally extended options, rather than collapse, are learned.

Authors: The concern about potential collapse to a dominant expert subset is well-taken, as the self-distillation objective focuses on output matching. To mitigate this and provide evidence against collapse, we have added diagnostics in the revised manuscript, including per-layer expert utilization entropy and diversity measures across tokens. These metrics indicate that the controller learns temporally extended options with diverse expert sets rather than collapsing, maintaining entropy levels comparable to the base model. We also reference the stress-test results to show robustness. revision: yes
Referee: [Method] The adaptation of the option-critic framework (intra-option policies, termination functions, and deliberation costs) to MoE layers is described at a high level but lacks the explicit loss or policy-gradient expressions used for the controller. Without these, it is difficult to verify that the temporal-extension mechanism is correctly instantiated and that the reported behavior follows from the framework rather than from the LoRA + distillation setup alone.

Authors: We appreciate the need for explicit formulations to allow verification. The revised method section now includes the detailed loss functions and policy-gradient expressions for the controller, specifically the gradients for the termination function, intra-option policy updates, and the integration of deliberation costs into the advantage estimation. This makes clear how the option-critic framework is adapted to MoE routing and separates its contribution from the LoRA adapters and self-distillation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical application of external RL framework

full rationale

The paper presents an engineering application of the established option-critic framework (with deliberation costs) to MoE layers, using LoRA adapters and a self-distillation reward defined against the external base model. Reported outcomes (switch-rate reduction to <5% and up to 90% accuracy retention on MATH/MMLU/MMMLU) are experimental measurements, not quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown that reduce the central claims to the inputs; the self-distillation signal and deliberation-cost trade-off are independent of the target metrics. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that expert sets can be treated as temporally extended options whose value can be learned via a small controller without destabilizing the base model. No new physical or mathematical entities are introduced.

axioms (1)

domain assumption The option-critic framework with deliberation costs can be applied to per-layer expert routing without changing the underlying transformer architecture.
Invoked when the authors state that the options framework is a perfect match for the switching problem.

pith-pipeline@v0.9.0 · 5495 in / 1478 out tokens · 19822 ms · 2026-05-10T00:37:05.235994+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
cs.LG 2026-05 conditional novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant...
EMO: Pretraining Mixture of Experts for Emergent Modularity
cs.CL 2026-05 conditional novelty 6.0

EMO uses document-boundary expert pooling during pretraining to induce emergent semantic modularity in MoE models, allowing 25% expert retention with only 1% performance drop.
EMO: Pretraining Mixture of Experts for Emergent Modularity
cs.CL 2026-05 unverdicted novelty 6.0

EMO pretrains MoEs using document boundaries to induce semantic expert specialization, enabling modular subset deployment with minimal accuracy loss unlike standard MoEs.

Reference graph

Works this paper leans on

49 extracted references · 33 canonical work pages · cited by 2 Pith papers · 16 internal anchors

[1]

The option-critic architecture, 2016

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture, 2016. URL https: //arxiv.org/abs/1609.05140

work page arXiv 2016
[2]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text- to-image diffusion models with an ensemble of expert denoisers, 2023. URL https://arxiv.org/ abs/2211.01324

work page internal anchor Pith review arXiv 2023
[3]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20,

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20,
[4]

A Survey on Mixture of Experts in Large Language Models , ISSN=

ISSN 2326-3865. doi: 10.1109/tkde.2025.3554028. URL http://dx.doi.org/10.1109/ TKDE.2025.3554028

work page doi:10.1109/tkde.2025.3554028 2025
[5]

Ma-rlhf: Reinforcement learning from human feedback with macro actions, 2025

Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, and Hua Wu. Ma-rlhf: Reinforcement learning from human feedback with macro actions, 2025. URL https://arxiv.org/abs/2410. 02743

2025
[6]

Unified scaling laws for routed language models

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational conference on machine learning, pages 4057–4086. PMLR, 2022

2022
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Enhancing math reasoning in small-sized llms via preview difficulty-aware intervention, 2025

Xinhan Di and JoyJiaoW. Enhancing math reasoning in small-sized llms via preview difficulty-aware intervention, 2025. URLhttps://arxiv.org/abs/2508.01604

work page arXiv 2025
[9]

Fast inference of mixture-of-experts language models with offloading,

Artyom Eliseev and Denis Mazur. Fast inference of mixture-of-experts language models with offloading,
[10]

URLhttps://arxiv.org/abs/2312.17238

work page arXiv
[11]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URLhttps://arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2022
[12]

Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts, 2023

Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Li Chen, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts, 2023. URLhttps://arxiv.org/abs/2210.15257

work page arXiv 2023
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini-Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...

work page internal anchor Pith review arXiv 2026
[15]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026. URLhttps://arxiv.org/abs/2306.08543

work page internal anchor Pith review arXiv 2026
[16]

When waiting is not an option : Learning options with a deliberation cost, 2017

Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option : Learning options with a deliberation cost, 2017. URLhttps://arxiv.org/abs/1709.04571

work page arXiv 2017
[17]

arXiv preprint arXiv:2407.04153 , year=

Xu Owen He. Mixture of a million experts.arXiv preprint arXiv:2407.04153, 2024

work page arXiv 2024
[18]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/ abs/2009.03300

work page internal anchor Pith review arXiv 2021
[19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review arXiv 2021
[20]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Flexible option learning, 2021

Martin Klissarov and Doina Precup. Flexible option learning, 2021. URL https://arxiv.org/abs/ 2112.03097

work page arXiv 2021
[23]

Richards, Rif A

Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, and João Sacramento. Emergent temporal abstractions in autoregressive models enable hierarchical reinforce...

work page arXiv 2025
[24]

Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs.arXiv preprint arXiv:2407.00945, 2024

Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs, 2024. URLhttps://arxiv.org/abs/2407.00945

work page arXiv 2024
[25]

On-policy distillation, 2025

Kevin Lu. On-policy distillation, 2025. URL https://thinkingmachines.ai/blog/ on-policy-distillation/

2025
[26]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,
[27]

URLhttps://arxiv.org/abs/2402.14800

work page arXiv
[28]

Seer-moe: Sparse expert efficiency through regularization for mixture-of-experts,

Alexandre Muzio, Alex Sun, and Churan He. Seer-moe: Sparse expert efficiency through regularization for mixture-of-experts, 2024. URLhttps://arxiv.org/abs/2404.05089. 16 Temporally Extended Mixture-of-Experts Models

work page arXiv 2024
[29]

Nemotron-Post- Training-Dataset-v2, August 2025

Dhruv Nathawani, Shuoyang Ding, Vitaly Lavrukhin, Igor Gitman, Somshubra Majum- dar, Evelina Bakhturina, Boris Ginsburg, and Jane Polak Scowcroft. Nemotron-Post- Training-Dataset-v2, August 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-Post-Training-Dataset-v2

2025
[30]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

work page internal anchor Pith review arXiv 2025
[31]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952

work page internal anchor Pith review arXiv 2023
[32]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

2026
[33]

Qwen3-next: Towards ultimate training and inference efficiency, 2025

QwenTeam. Qwen3-next: Towards ultimate training and inference efficiency, 2025. URL https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from= research.latest-advancements-list

2025
[34]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation, 2018. URL https://arxiv.org/abs/ 1506.02438

work page internal anchor Pith review arXiv 2018
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017. URL https://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Promoe: Fast moe-based llm serving using proactive caching, 2025

Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. Promoe: Fast moe-based llm serving using proactive caching, 2025. URLhttps://arxiv.org/abs/2410.22134. 17 Temporally Extended Mixture-of-Experts Models

work page arXiv 2025
[37]

InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URLhttps://arxiv.org/abs/2306.11695

work page arXiv 2024
[38]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1):181–211, 1999. ISSN 0004-

1999
[39]

DOI: https://doi.org/10.1016/S0004-3702(99)00052-1

doi: https://doi.org/10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect. com/science/article/pii/S0004370299000521

work page doi:10.1016/s0004-3702(99)00052-1
[40]

emoe: Task-aware memory efficient mixture-of-experts-based (moe) model inference, 2025

Suraiya Tairin, Shohaib Mahmud, Haiying Shen, and Anand Iyer. emoe: Task-aware memory efficient mixture-of-experts-based (moe) model inference, 2025. URL https://arxiv.org/abs/2503. 06823

2025
[41]

TRL: Transformers Reinforcement Learning,

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning,
[42]

URLhttps://github.com/huggingface/trl
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router.arXiv preprint arXiv:2410.12013, 2024

Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, and An Xu. Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router, 2024. URL https://arxiv.org/abs/2410.12013

work page arXiv 2024
[45]

Moe-infinity: Efficient moe inference on personal machines with sparsity-aware expert cache, 2025

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Efficient moe inference on personal machines with sparsity-aware expert cache, 2025. URL https://arxiv.org/abs/2401. 14361

2025
[46]

Deep Sets

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexan- der Smola. Deep sets, 2018. URLhttps://arxiv.org/abs/1703.06114

work page Pith review arXiv 2018
[47]

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

Yuning Zhang, Grant Pinkert, Nan Yang, Yanli Li, and Dong Yuan. Duoserve-moe: Dual-phase expert prefetch and cache scheduling for efficient moe llm inference, 2025. URL https://arxiv.org/ abs/2509.07379. 18 Temporally Extended Mixture-of-Experts Models A1 Related Works Mixture-of-Experts Models.MoE architectures have become the dominant paradigm for scali...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

So 2 leftover? That would mean we had 21 more? Let’s compute again: 403200/27 =3150

14/7=2 ? Wait 14/7=2 . So 2 leftover? That would mean we had 21 more? Let’s compute again: 403200/27 =3150 . 3150/32 =3150/9=350 . 350/52 =350/25=14 . 14/7=2 . So leftover 21. So actual factorization: 28? Let’s double-check: 27 ·3 2 ·5 2 ·7 ? That would produce 27 ·3 2 ·5 2 ·7=128·9·25·7=128·9=1152 ; 1152·25=28800 ; 28800·7=201600 ? Wait 1152·25=28800;288...
[49]

6!+7!=720+5040=5760

5!+6!=120+720=840 . 6!+7!=720+5040=5760 . 7!+8!=5040+40320=90720 . 8!+9!=40320+3628800=3628800. The sums:3, 8, 30, 144, 840, 5760, 90720, 3628800. We need to find the least common multiple of these numbers: LCM(3, 8, 30, 144, 840, 5760, 90720, 3628800). Let’s compute the LCM. We can compute the prime factorization of each number and find the LCM. Let’s do...