ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Peng Cheng; Sangjin Choi; Sukmin Cho; Yifan Xiong; Youngjin Kwon; Ziyue Yang

arxiv: 2607.00466 · v1 · pith:HAHOFQKPnew · submitted 2026-07-01 · 💻 cs.DC

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Sangjin Choi , Sukmin Cho , Yifan Xiong , Ziyue Yang , Youngjin Kwon , Peng Cheng This is my paper

Pith reviewed 2026-07-02 06:38 UTC · model grok-4.3

classification 💻 cs.DC

keywords MoE servingdecode routingexpert localityPD disaggregationload balancingLLM inferencemixture of expertstime per output token

0 comments

The pith

Routing decode requests by predicted expert activations cuts median TPOT 5.9-13.9% in MoE PD-disaggregated serving while keeping outputs identical.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In prefill-decode disaggregated serving, load balancing alone fails to minimize latency for mixture-of-experts models because each decode step must fetch weights for every distinct expert activated by the current batch. The paper establishes that an expert signature computed from prefill activations can predict the experts needed during generation for a given request. Offline balanced K-means then divides the space of signatures across decode workers, and online locality-band routing sends each request to the least-loaded worker whose partition matches the signature. A cache stores signatures at KV-block granularity to preserve exactness under prefix caching. This method is shown to deliver the reported latency gains across three models and two workloads without altering generated outputs.

Core claim

ELDR builds an expert signature from a request's prefill expert activations to forecast the experts it will activate during decode. Balanced K-means partitions the signature space across decode workers offline. Online locality-band routing directs each request to the least-loaded worker among those whose partition best matches its signature. A signature cache co-indexed with the KV cache maintains precision when prefix caching is used. Across three MoE models and two workloads the approach reduces median time-per-output-token by 5.9-13.9 percent relative to the strongest load-balancing baselines while producing unchanged model outputs.

What carries the argument

The expert signature from prefill activations, combined with offline balanced K-means partitioning of signature space and online locality-band routing that selects the least-loaded matching worker.

If this is right

Decode routing decisions can be improved by incorporating predicted expert locality in addition to instantaneous load.
Tying the signature cache to KV-block granularity preserves routing accuracy under prefix caching.
The same routing logic leaves model outputs unchanged, so correctness is unaffected.
The measured gains hold for deployments scaling to 40 GPUs and across multiple MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If signatures remain stable across turns, the method could reduce the frequency of expert weight movement between workers.
The same signature-based partitioning idea might apply to other serving disaggregation boundaries where activation patterns are request-specific.
Dynamic re-partitioning of signature space could be tested if workload expert distributions drift over time.

Load-bearing premise

The set of experts activated during prefill for a request reliably predicts the experts that will be activated during its decode phase.

What would settle it

A direct comparison, for a large set of requests, showing that the experts actually activated during decode frequently differ from the signature predicted from prefill activations.

Figures

Figures reproduced from arXiv: 2607.00466 by Peng Cheng, Sangjin Choi, Sukmin Cho, Yifan Xiong, Youngjin Kwon, Ziyue Yang.

**Figure 1.** Figure 1: Decode-phase per-expert activation relative to the cross-domain mean, for three MoE models along task (top) and language (bottom, WildChat [37]) domains at each model’s most discriminative layer. Darker is above-average (below-average clipped to white); experts are reordered per panel into contiguous per-domain blocks. Each domain over-activates a distinct subset of experts. Task domains: Code [1, 4, 17, 4… view at source ↗

**Figure 2.** Figure 2: MoE layer latency scales with active experts, not batch size (single MoE layer, one MI300X). leaves the activated-expert cost each request imposes on its decode worker unmodeled. 3 Motivation 3.1 Active Expert Count Drives MoE Decode Latency MoE decode latency is governed by the number of distinct experts activated at each decode step. Sparsity reduces computation but amplifies decode’s memory-bandwidth b… view at source ↗

**Figure 3.** Figure 3: Prefill expert activation predicts decode activation. Each point is one expert (normalized prefill 𝑥 vs. decode 𝑦, pooled over domains); points near the diagonal are experts used about equally in both phases. Russian, and French requests activate different expert regions rather than uniformly exercising the full expert pool. This shows that expert selection is correlated across related requests, not just … view at source ↗

**Figure 5.** Figure 5: WildChat [37] request volume is heavily skewed: English and Chinese alone are ∼75% of requests. to clustering—one in which proximity between two signatures reflects how much their requests overlap in decodetime expert activation. Only such a space lets a clustering group requests by genuine expert affinity rather than by spurious similarity. The design space is large: raw activation counts, gate logits, … view at source ↗

**Figure 6.** Figure 6: ELDR architecture: offline fitting of one centroid per decode worker over expert signatures, then online routing at the prefill?밺ecode handoff by signature similarity, subject to load. therefore absent from this request’s prefill—even though it was produced (and discarded) when an earlier request first populated the cache. Without a mechanism to recover this footprint, the signature for a cache-hit reques… view at source ↗

**Figure 7.** Figure 7: Signature quality 𝜌 (Eq. 1) for six candidate transformations 𝑇 . Bars are the mean across six cells (3 models × 2 workloads); whiskers span the per-cell min/max. 0 20 40 Layers kept 0.6 0.8 1.0 C u m ula tiv e Qwen3-30B-A3B 0 20 Layers kept GPT-OSS-120B 0 20 Layers kept Gemma-4-26B-A4B Task Language [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative 𝜌 (Eq. 1) versus the number of layers kept under greedy layer selection. One panel per model; task (blue) and language (orange) shown separately. The star marks the peak 𝑁 ∗ chosen by ELDR’s offline fit. The signature 𝑠𝑟 is the building block consumed by the decode clustering and routing layers (§4.3). 4.2.3 Validation. We validate each design choice by its effect on 𝜌 (Eq. 1), measured on 1,000… view at source ↗

**Figure 9.** Figure 9: projects the calibration signatures onto their first two principal components and overlays the balanced 𝐾- means centroids. Task domains (Code/Math/Medical/Legal) and WildChat languages (English/Chinese/Russian/French) occupy distinct regions across all three models—the signature space has genuine semantic structure for the clustering to exploit. The centroids spread across the distinct regions rather tha… view at source ↗

**Figure 10.** Figure 10: ELDR stores expert signatures at KV cache block granularity: the signature cache is co-indexed with KV cache. A naive approach is to cache each request’s expert signature and reuse it when an identical prompt arrives. But a request can interact with the prefix cache in several ways: a prompt may miss the cache entirely, hit on some leading blocks and compute the rest (a partial hit), or hit on the entire… view at source ↗

**Figure 11.** Figure 11: TPOT (median, p99) and median TTFT vs request rate on the task workload at 8P16D. 15 20 25 P50 TPOT (ms) Qwen3-30B-A3B 30 40 50 GPT-OSS-120B 10 15 20 25 Gemma-4-26B-A4B 20 25 30 P99 TPOT (ms) 30 40 50 60 15 20 25 30 20 60 100 Request rate (req/s) 100 200 P50 TTFT (ms) 20 60 100 Request rate (req/s) 200 400 600 20 60 100 Request rate (req/s) 100 200 Random RR JSQ P2C Domain ELDR [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 13.** Figure 13: Mean active experts per decode step on Qwen3- 30B-A3B in the task domain on an 8P16D cluster. 10 0 Task v s R R (%) -15.7 -9.8 -12.3 -9.4 Qwen3-30B-A3B -16.1 -11.3 -2.4 -0.0 GPT-OSS-120B -9.8 -9.9 -9.7 -9.1 Gemma-4-26B-A4B TPOT P50 TPOT P99 10 0 Language v s R R (%) -12.7 -3.6 -9.5 -5.8 TPOT P50 TPOT P99 -4.5 +0.8 +0.1 -1.6 TPOT P50 TPOT P99 -11.2 -5.7 -7.8 -5.7 count idf gate-prob [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 14.** Figure 14: TPOT P50/P99 % Δ vs. RR at 𝑟=60 req/s (8P16D, six (model, dataset) cells) for two signature transforms: the IDF-reweighted top-𝑘 count (count·idf) vs. the continuous softmax gate (gate-prob). Rest of the recipe fixed (greedy mask, balanced 𝐾-means 𝐾=16, 𝜏=0.1). 6.3 Design Validation Active-Expert Reduction. We validate that ELDR’s TPOT improvements come from a reduction in the number of distinct experts … view at source ↗

**Figure 15.** Figure 15: Mean %Δ vs. RR over five request rates (20– 100 qps) at 8P16D with 𝜏=0.1. 0.0 0.1 0.2 0.3 Locality band width ( ) Qwen task Qwen lang GPT-OSS task GPT-OSS lang Gemma task Gemma lang -11.1 -12.7 -11.1 -11.1 -9.3 -8.8 -6.7 -5.4 -8.6 -12.0 -11.3 -11.0 -4.3 -5.2 -4.2 -2.1 -7.5 -8.6 -8.9 -8.2 -8.3 -7.6 -5.9 -4.8 TPOT P50 (% vs RR) 0.0 0.1 0.2 0.3 Locality band width ( ) +2.1 -5.4 -6.2 -6.9 -2.1 -6.0 -5.9 -3.0 … view at source ↗

**Figure 16.** Figure 16: Mean %Δ vs RR (five rates, 20–100 qps; 8P16D) for six cells (three models × two datasets) at four 𝜏 values. to 6.8%—uniform decoder utilization keeps the tail in check without sacrificing the median win. Locality Band Width [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗

read the original abstract

In prefill-decode (PD) disaggregated LLM serving, each request is assigned to a decode worker after prefill. Existing decode routers balance only load; for mixture-of-experts (MoE) models this is incomplete: equally loaded workers can differ in latency, since each decode step loads the weights of every distinct expert its batch activates. We present ELDR, an expert-locality-aware decode router for PD-disaggregated MoE serving. From a request's prefill expert activations, ELDR builds an expert signature predicting the experts it will activate during generation. Offline, balanced K-means partitions signature space across decode workers; online, locality-band routing sends each request to the least-loaded worker among those best matching its signature. A signature cache, co-indexed with the KV cache at KV-block granularity, keeps signatures exact under prefix caching. Implemented in vLLM and evaluated on deployments of up to 40 GPUs, ELDR reduces median TPOT by 5.9-13.9% over the strongest of four load-balancing baselines across three MoE models and two workloads, with model outputs unchanged.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELDR adds a decode router that routes on prefill expert signatures via K-means clusters and shows modest TPOT gains on real hardware, but the locality benefit hinges on an unproven correlation that the paper needs to document more clearly.

read the letter

ELDR's core idea is to build an expert signature from a request's prefill activations, partition the signature space with balanced K-means across decode workers, and then route each decode request to the least-loaded worker whose cluster best matches the signature. They tie a signature cache to the KV cache at block granularity so prefix caching stays exact. The implementation sits in vLLM and they ran it on up to 40 GPUs across three MoE models and two workloads.

The practical bits are done right: the routing is online and cheap, the cache keeps signatures consistent, and outputs stay identical to the baselines. The reported 5.9-13.9% median TPOT drop over the best of four load-balancing baselines is the kind of number that matters for production decode clusters.

The soft spot is the central assumption. The method only helps if the prefill signature actually predicts which experts get activated token-by-token during autoregressive decode. The abstract claims it predicts, but without reported overlap statistics or an ablation that isolates the locality component from the load-balancer tie-breaker, it is hard to know how much of the gain is real locality versus workload-specific luck. If the correlation is only moderate, the K-means partitioning adds little over plain least-loaded routing.

This paper is for people who already run large MoE models in disaggregated setups and are looking for incremental router tweaks. The evaluation is on actual hardware at reasonable scale, the code changes are described, and the idea is self-contained, so it deserves a serious referee even if the gains stay modest after closer inspection.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ELDR, an expert-locality-aware decode router for prefill-decode disaggregated MoE LLM serving. From a request's prefill expert activations, it constructs an expert signature to predict decode-phase experts, applies offline balanced K-means partitioning of signature space across decode workers, and performs online locality-band routing to the least-loaded best-matching worker. A signature cache co-indexed with the KV cache at block granularity supports prefix caching. Implemented in vLLM, it reports 5.9-13.9% median TPOT reduction versus the strongest of four load-balancing baselines across three MoE models and two workloads on up to 40 GPUs, with unchanged model outputs.

Significance. If the prefill-to-decode prediction holds, the approach yields a practical latency improvement for MoE serving by reducing per-step expert weight loading via locality without quality loss. Credit is due for the vLLM implementation, the KV-cache-co-indexed signature mechanism, and the multi-model/multi-workload evaluation scope, which together support deployability claims.

major comments (1)

[methods (expert signature)] Expert signature construction (methods section): the claim that the prefill-derived signature 'predicts' the experts activated during autoregressive decode generation is load-bearing for the locality benefit, yet the manuscript provides no quantitative validation such as per-request overlap statistics, correlation coefficients, or ablation on prediction accuracy between prefill signatures and actual decode expert choices. Without this, the reported TPOT gains cannot be attributed to the expert-locality mechanism rather than load balancing or tie-breaking.

minor comments (2)

[abstract] Abstract and evaluation: workloads, model sizes, and exact baseline implementations should be named explicitly rather than summarized as 'two workloads' and 'four load-balancing baselines' to allow reproduction.
[system design] The signature cache description would benefit from a diagram showing its co-indexing with KV blocks at the stated granularity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment point-by-point below.

read point-by-point responses

Referee: [methods (expert signature)] Expert signature construction (methods section): the claim that the prefill-derived signature 'predicts' the experts activated during autoregressive decode generation is load-bearing for the locality benefit, yet the manuscript provides no quantitative validation such as per-request overlap statistics, correlation coefficients, or ablation on prediction accuracy between prefill signatures and actual decode expert choices. Without this, the reported TPOT gains cannot be attributed to the expert-locality mechanism rather than load balancing or tie-breaking.

Authors: We agree that the manuscript lacks explicit quantitative validation of the prefill-to-decode prediction accuracy, which is needed to more rigorously attribute gains to locality rather than load balancing. In the revised manuscript we will add: (1) per-request overlap statistics (Jaccard similarity and set overlap ratios between the prefill expert signature and experts activated in the first 8–32 decode steps, reported as workload averages with standard deviation); (2) Pearson or Spearman correlation coefficients between signature similarity and observed decode expert overlap where applicable; and (3) an ablation that replaces the signature-based locality band with either pure load balancing or randomized worker assignment while keeping all other components fixed. These additions will be placed in a new subsection of the methods or evaluation and will use the same three models and two workloads already reported. The end-to-end TPOT results remain unchanged, but the new metrics will directly address the attribution concern. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluated against external baselines

full rationale

The paper describes a routing heuristic that constructs an expert signature from observed prefill activations, applies offline K-means partitioning on that signature space, and routes online to the least-loaded matching worker. Performance is measured directly via TPOT reductions on real deployments against four independent load-balancing baselines. No derivation step reduces by construction to a fitted parameter renamed as prediction, no self-citation chain supports a uniqueness claim, and the predictiveness assumption is not smuggled in via definition but is instead the subject of the end-to-end empirical comparison. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; the approach rests on the domain assumption that prefill activations predict decode expert usage patterns sufficiently for clustering to improve performance over load balancing alone.

axioms (1)

domain assumption Expert activations during prefill are predictive of those during decode generation
This is invoked to justify building the expert signature from prefill for routing decisions.

pith-pipeline@v0.9.1-grok · 5743 in / 1243 out tokens · 31324 ms · 2026-07-02T06:38:42.662372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 32 canonical work pages · 17 internal anchors

[1]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]https://arxiv.org/abs/2108. 07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jae- won Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, and Tushar Krishna. 2026. Scaling Multi-Node Mixture-of-Experts In- ference Using Expert Activation Patterns. arXiv:2604.23150 [cs.LG] https://arxiv.org/abs/2604.23150

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion An- droutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Al...

work page doi:10.18653/v1/2022.acl- 2022
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Train- ing Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

DeepSeek-AI. 2025. EPLB: Expert Parallelism Load Balancer.https: //github.com/deepseek-ai/EPLB

2025
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Seokjin Go and Divya Mahajan. 2025. MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing. arXiv:2502.06643 [cs.LG]https://arxiv.org/abs/2502.06643

work page arXiv 2025
[9]

Google DeepMind. 2025. Gemma 4 26B-A4B.https://huggingface.co/ google/gemma-4-26b-a4b

2025
[10]

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, and Anand Padmanabha Iyer. 2026. Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection. arXiv:2411.08982 [cs.LG]https://arxiv.org/abs/2411.08982

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. arXiv:2402.14008 [cs.CL] https://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mas- sive Multitask Language Understanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Mea- suring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). doi:10.3390/app11146421

work page doi:10.3390/app11146421 2021
[15]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent ...

work page doi:10.18653/v1/d19-1259 2019
[16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
[17]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG]https://arxiv. org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: a natural and reliable benchmark for data science code generation. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 756, 27 pages

2023
[19]

Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, and Pengfei Zheng. 2026. Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling. arXiv:2503.04398 [cs.LG]https://arxiv. org/abs/2503.04398

work page arXiv 2026
[20]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step. arXiv:2305.20050 [cs.LG] https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems. arXiv:1705.04146 [cs.AI] https://arxiv.org/abs/1705.04146

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

S. Lloyd. 1982. Least squares quantization in PCM.IEEE Transactions on Information Theory28, 2 (1982), 129–137. doi:10.1109/TIT.1982.1056489

work page doi:10.1109/tit.1982.1056489 1982
[23]

Malinen and Pasi Fränti

Mikko I. Malinen and Pasi Fränti. 2014. Balanced K-Means for Clus- tering. InStructural, Syntactic, and Statistical Pattern Recognition, Pasi Fränti, Gavin Brown, Marco Loog, Francisco Escolano, and Marcello Pelillo (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 32–41

2014
[24]

Mitzenmacher

M. Mitzenmacher. 2001. The power of two choices in randomized load balancing.IEEE Transactions on Parallel and Distributed Systems12, 10 (2001), 1094–1104. doi:10.1109/71.963420

work page doi:10.1109/71.963420 2001
[25]

Xuan-Phi Nguyen, Shrey Pandit, Austin Xu, Caiming Xiong, and Shafiq Joty. 2026. Least-Loaded Expert Parallelism: Load Balanc- ing An Imbalanced Mixture-of-Experts. arXiv:2601.17111 [cs.LG] https://arxiv.org/abs/2601.17111

work page arXiv 2026
[26]

NVIDIA. 2026. NVIDIA Dynamo: A Datacenter-Scale Distributed In- ference Serving Framework.https://github.com/ai-dynamo/dynamo

2026
[27]

NVIDIA Corporation. 2024. NIXL: NVIDIA Inference Xfer Library. https://github.com/ai-dynamo/nixl

2024
[28]

Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, and Ben Athiwaratkun. 2025. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining. arXiv:2511.02237 [cs.LG]https: //arxiv.org/abs/2511.02237

work page arXiv 2025
[29]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, 횒 챰igo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Split- wise: Efficient generative LLM inference using phase splitting. arXiv:2311.18677 [cs.AR]https://arxiv.org/abs/2311.18677

work page arXiv 2024
[31]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yong- wei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC]https://arxiv.org/abs/2407.00079

work page arXiv 2025
[32]

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2024. Preble: Efficient Distributed Prompt Scheduling for LLM Serving. arXiv:2407.00023 [cs.DC]https://arxiv.org/abs/2407. 00023

work page arXiv 2024
[33]

The llm-d Authors. 2026. llm-d: Kubernetes-Native Distributed Infer- ence.https://github.com/llm-d/llm-d. 14

2026
[34]

Daniil Vankov, Nikita Ivkin, Kyle Ulrich, Xiang Song, Ashish Khetan, and George Karypis. 2026. XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference. arXiv:2602.07265 [cs.LG]https: //arxiv.org/abs/2602.07265

work page arXiv 2026
[35]

Wayne Winston. 1977. Optimality of the shortest line discipline.Jour- nal of Applied Probability14, 1 (1977), 181??89. doi:10.2307/3213271

work page doi:10.2307/3213271 1977
[36]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, and Anurag Khandelwal. 2025. Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens. arXiv:2512.09277 [cs.DC]https://arxiv.org/abs/2512.09277

work page arXiv 2025
[38]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470 [cs.CL]https://arxiv.org/abs/2405.01470

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Sto- ica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI]https://arxiv.org/abs/2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 193– 210.https://www.usenix.org/co...

2024
[41]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhou- jun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Bi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]https://arxiv.org/abs/2108. 07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jae- won Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, and Tushar Krishna. 2026. Scaling Multi-Node Mixture-of-Experts In- ference Using Expert Activation Patterns. arXiv:2604.23150 [cs.LG] https://arxiv.org/abs/2604.23150

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion An- droutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Al...

work page doi:10.18653/v1/2022.acl- 2022

[4] [4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Train- ing Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

DeepSeek-AI. 2025. EPLB: Expert Parallelism Load Balancer.https: //github.com/deepseek-ai/EPLB

2025

[7] [7]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Seokjin Go and Divya Mahajan. 2025. MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing. arXiv:2502.06643 [cs.LG]https://arxiv.org/abs/2502.06643

work page arXiv 2025

[9] [9]

Google DeepMind. 2025. Gemma 4 26B-A4B.https://huggingface.co/ google/gemma-4-26b-a4b

2025

[10] [10]

Vima Gupta, Jae Hyung Ju, Kartik Sinha, Ada Gavrilovska, and Anand Padmanabha Iyer. 2026. Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection. arXiv:2411.08982 [cs.LG]https://arxiv.org/abs/2411.08982

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. arXiv:2402.14008 [cs.CL] https://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mas- sive Multitask Language Understanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Mea- suring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). doi:10.3390/app11146421

work page doi:10.3390/app11146421 2021

[15] [15]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A Dataset for Biomedical Research Question Answering. InProceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent ...

work page doi:10.18653/v1/d19-1259 2019

[16] [16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

[17] [17]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG]https://arxiv. org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: a natural and reliable benchmark for data science code generation. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 756, 27 pages

2023

[19] [19]

Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, and Pengfei Zheng. 2026. Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling. arXiv:2503.04398 [cs.LG]https://arxiv. org/abs/2503.04398

work page arXiv 2026

[20] [20]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step. arXiv:2305.20050 [cs.LG] https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems. arXiv:1705.04146 [cs.AI] https://arxiv.org/abs/1705.04146

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

S. Lloyd. 1982. Least squares quantization in PCM.IEEE Transactions on Information Theory28, 2 (1982), 129–137. doi:10.1109/TIT.1982.1056489

work page doi:10.1109/tit.1982.1056489 1982

[23] [23]

Malinen and Pasi Fränti

Mikko I. Malinen and Pasi Fränti. 2014. Balanced K-Means for Clus- tering. InStructural, Syntactic, and Statistical Pattern Recognition, Pasi Fränti, Gavin Brown, Marco Loog, Francisco Escolano, and Marcello Pelillo (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 32–41

2014

[24] [24]

Mitzenmacher

M. Mitzenmacher. 2001. The power of two choices in randomized load balancing.IEEE Transactions on Parallel and Distributed Systems12, 10 (2001), 1094–1104. doi:10.1109/71.963420

work page doi:10.1109/71.963420 2001

[25] [25]

Xuan-Phi Nguyen, Shrey Pandit, Austin Xu, Caiming Xiong, and Shafiq Joty. 2026. Least-Loaded Expert Parallelism: Load Balanc- ing An Imbalanced Mixture-of-Experts. arXiv:2601.17111 [cs.LG] https://arxiv.org/abs/2601.17111

work page arXiv 2026

[26] [26]

NVIDIA. 2026. NVIDIA Dynamo: A Datacenter-Scale Distributed In- ference Serving Framework.https://github.com/ai-dynamo/dynamo

2026

[27] [27]

NVIDIA Corporation. 2024. NIXL: NVIDIA Inference Xfer Library. https://github.com/ai-dynamo/nixl

2024

[28] [28]

Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, and Ben Athiwaratkun. 2025. Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining. arXiv:2511.02237 [cs.LG]https: //arxiv.org/abs/2511.02237

work page arXiv 2025

[29] [29]

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bubeck, Che Chang, Kai Chen, Mark Chen, Enoch Cheung, Aidan Clark, Dan Cook, Marat Dukhan, Casey Dvorak, Kevin Fives, V...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, 횒 챰igo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Split- wise: Efficient generative LLM inference using phase splitting. arXiv:2311.18677 [cs.AR]https://arxiv.org/abs/2311.18677

work page arXiv 2024

[31] [31]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yong- wei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC]https://arxiv.org/abs/2407.00079

work page arXiv 2025

[32] [32]

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2024. Preble: Efficient Distributed Prompt Scheduling for LLM Serving. arXiv:2407.00023 [cs.DC]https://arxiv.org/abs/2407. 00023

work page arXiv 2024

[33] [33]

The llm-d Authors. 2026. llm-d: Kubernetes-Native Distributed Infer- ence.https://github.com/llm-d/llm-d. 14

2026

[34] [34]

Daniil Vankov, Nikita Ivkin, Kyle Ulrich, Xiang Song, Ashish Khetan, and George Karypis. 2026. XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference. arXiv:2602.07265 [cs.LG]https: //arxiv.org/abs/2602.07265

work page arXiv 2026

[35] [35]

Wayne Winston. 1977. Optimality of the shortest line discipline.Jour- nal of Applied Probability14, 1 (1977), 181??89. doi:10.2307/3213271

work page doi:10.2307/3213271 1977

[36] [36]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, and Anurag Khandelwal. 2025. Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens. arXiv:2512.09277 [cs.DC]https://arxiv.org/abs/2512.09277

work page arXiv 2025

[38] [38]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. arXiv:2405.01470 [cs.CL]https://arxiv.org/abs/2405.01470

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Sto- ica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI]https://arxiv.org/abs/2312.07104

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 193– 210.https://www.usenix.org/co...

2024

[41] [41]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhou- jun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Bi...

work page internal anchor Pith review Pith/arXiv arXiv 2025