From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

Gunjun Lee; Jaiyoung Park; Jiwon Kim; Jung Ho Ahn; Younjoo Lee

arxiv: 2510.08055 · v2 · submitted 2025-10-09 · 💻 cs.LG · cs.DC

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

Gunjun Lee , Jiwon Kim , Jaiyoung Park , Younjoo Lee , Jung Ho Ahn This is my paper

Pith reviewed 2026-05-18 08:44 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords Mixture-of-ExpertsLLM inference servingprefill schedulingstall-free decodinglayer partitioningenergy efficiencyMoE weight loading

0 comments

The pith

Layered prefill partitions MoE models into contiguous layer groups to interleave prefill and decode, sustaining stall-free operation while cutting TTFT by up to 70% and energy by 22%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces layered prefill for Mixture-of-Experts LLM serving to address limitations of chunked prefill. Chunked prefill splits long prompts by tokens and interleaves with decode but causes redundant expert weight loads in MoE models, increasing memory traffic by up to 39%. Layered prefill instead partitions the model vertically into contiguous layer groups and interleaves prefill and decode across these groups. This maintains stable time-between-token while reducing off-chip bandwidth demand. Evaluations indicate it improves the TTFT-TBT trade-off and lowers per-token energy consumption.

Core claim

By shifting the scheduling axis from tokens to layers, layered prefill treats contiguous layer groups as atomic scheduling units. It interleaves prefill and decode across these groups to achieve stall-free decoding without the chunk-induced MoE weight reloads, thereby lowering TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by up to 22%.

What carries the argument

The layered prefill scheduler, which vertically partitions the transformer into contiguous layer groups and interleaves prefill and decode operations across these groups instead of token chunks.

If this is right

Reduces off-chip bandwidth demand by eliminating redundant expert weight loads.
Lowers TTFT by up to 70% while preserving stall-free decoding.
Decreases end-to-end latency by 41% and per-token energy by up to 22%.
Consistently improves the TTFT-TBT Pareto frontier over chunked prefill.
Lowers expert-load traffic and energy cost in co-located prefill-decode environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer-group scheduling might benefit dense transformer models by reducing activation movement even without experts.
The approach could extend to distributed serving setups where layer groups align with device boundaries.
Future work might explore dynamic group sizing based on prompt length or model depth to optimize further.
Combining layered prefill with other memory optimizations could yield additional gains in energy efficiency.

Load-bearing premise

Contiguous layer groups can be scheduled as atomic units without violating sequential data dependencies between layers or introducing synchronization overhead that reintroduces stalls.

What would settle it

An experiment measuring whether synchronization costs between layer groups exceed the bandwidth savings from avoided expert reloads, or whether TBT exceeds targets for models with high interconnect latency.

Figures

Figures reproduced from arXiv: 2510.08055 by Gunjun Lee, Jaiyoung Park, Jiwon Kim, Jung Ho Ahn, Younjoo Lee.

**Figure 1.** Figure 1: (Upper right) Per iteration, chunked prefill splits an input prompt into multiple chunks, and at each iteration one chunk is processed in order from the beginning with the decode. (Lower right) For layered prefill, exactly one layer group performs both prefill and decode, while the others perform decode only. Prefill advances by one group per iteration, maintaining stall-free decoding. SLOs, service provid… view at source ↗

**Figure 2.** Figure 2: (Left) MoE weight loading vs. chunk size. The hatched region indicates the MoE weights loaded by a single chunk. (Right) runtime of each kernel vs. chunk size. We fix the input length fixed at 8,192 tokens [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: SLO attainment under different request rates. The red horizontal line marks the effective SLO attainment threshold (90%). layered prefill across request rates for two models (Qwen and GPT) and two workloads (arXiv and ShareGPT). Qwen: (a) On arXiv, layered prefill sustains ≈100% SLO attainment through 1.7 req/s, while chunked prefill collapses by 1.5; at 1.8 req/s layered prefill remains well above chunke… view at source ↗

**Figure 5.** Figure 5: Token generation over time on arXiv with Qwen. pare cumulative token output for a single request on Qwen using the arXiv workload at a request rate of 1.3 req/s under chunked prefill and layered prefill. The steeply rising middle interval reflects the period when layered prefill has quickly finished other requests’ prefills and runs in decodeonly mode, so token generation accelerates. These factors reduc… view at source ↗

read the original abstract

Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits the processing of long prompts along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to 39% and inflate energy consumption. We propose layered prefill, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit, specifically targeting MoE serving. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to 70%, end-to-end latency by 41% and per-token energy by up to 22%. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware MoE serving in co-located environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Layered prefill changes the scheduling unit to layer groups in MoE models to cut expert reloads while keeping decode stall-free, with reported gains that need more experimental backing on boundary costs.

read the letter

The one thing to know is that this paper proposes layered prefill to improve MoE serving by scheduling on layer groups rather than token chunks. It claims big wins on TTFT and energy by cutting expert reloads while keeping decode stall-free. What is new is the vertical partitioning approach. Instead of splitting along the sequence dimension like chunked prefill, it divides the model into contiguous layer groups and interleaves prefill and decode across them. This targets the redundant weight loads that happen in MoE when experts are reloaded for each chunk. The paper shows this reduces off-chip bandwidth demand and improves the latency-energy tradeoffs in co-located environments. The work does well at grounding the idea in a concrete production constraint: meeting both TTFT and TBT under fixed hardware. The reported numbers—70% TTFT reduction, 41% end-to-end latency, 22% energy—come from empirical measurements on the new scheduling axis. That shift from tokens to layers is not just a minor tweak on existing chunked prefill methods. Soft spots include the lack of detail on how layer groups are chosen and sized. There are no error bars or sensitivity analysis in the high-level claims. The assumption that contiguous groups can be scheduled atomically without dependency violations or added synchronization overhead at boundaries is central but not stress-tested against different model depths or hardware latencies. If those overheads are present, they could eat into the gains. This paper is for people building or optimizing large MoE inference systems. Readers who care about practical serving efficiency and memory traffic in production clusters will find it relevant. It deserves a serious referee because the idea addresses a genuine gap in current stall-free techniques for MoE. I recommend engaging with it in review, with attention to the experimental validation of the boundary costs.

Referee Report

1 major / 2 minor

Summary. The paper proposes layered prefill, a scheduling paradigm for MoE LLM serving that vertically partitions the model into contiguous layer groups and interleaves prefill and decode across groups. This replaces token-dimension chunking to eliminate redundant expert weight reloads while preserving stall-free TBT, with reported gains of up to 70% lower TTFT, 41% lower end-to-end latency, and 22% lower per-token energy.

Significance. If validated, the shift from token-based to layer-based scheduling granularity could meaningfully improve memory bandwidth efficiency and energy use in co-located MoE inference without sacrificing latency SLOs. The empirical Pareto-frontier improvements over chunked prefill represent a concrete contribution to serving-system design for large expert models.

major comments (1)

The central claim depends on treating contiguous layer groups as atomic scheduling units for interleaving prefill and decode. Sequential hidden-state dependencies between layers imply that group boundaries may require buffering, barriers, or pipeline flushes; the manuscript provides no analysis of the resulting synchronization or memory-traffic overhead as a function of group size, model depth, or interconnect latency. This premise is load-bearing for the reported 70% TTFT and 41% latency reductions.

minor comments (2)

Experimental results report concrete percentage gains but omit error bars, the exact layer-group sizes used, and a breakdown of how much improvement derives from reduced expert reloads versus other factors.
Notation for group boundaries and the precise interleaving schedule should be formalized (e.g., with a diagram or pseudocode) to clarify dataflow across groups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of shifting scheduling granularity from tokens to layers in MoE serving. We address the major comment point by point below. We will incorporate additional analysis in the revised manuscript to strengthen the treatment of synchronization and overhead at layer-group boundaries.

read point-by-point responses

Referee: The central claim depends on treating contiguous layer groups as atomic scheduling units for interleaving prefill and decode. Sequential hidden-state dependencies between layers imply that group boundaries may require buffering, barriers, or pipeline flushes; the manuscript provides no analysis of the resulting synchronization or memory-traffic overhead as a function of group size, model depth, or interconnect latency. This premise is load-bearing for the reported 70% TTFT and 41% latency reductions.

Authors: We thank the referee for this important observation. Within each contiguous layer group, layers execute sequentially with direct hidden-state handoff exactly as in a standard forward pass; no extra buffering is introduced inside the group. Interleaving between prefill and decode occurs only at group boundaries, which serve as natural synchronization points where the scheduler can context-switch requests. This design mirrors the boundary handling already present in pipeline-parallel inference but applied to prefill-decode co-location. We acknowledge that the submitted manuscript did not contain a dedicated quantitative analysis of synchronization or memory-traffic overhead as a function of group size, depth, or interconnect latency. In the revision we will add a new subsection that (1) provides an analytical model of the additional traffic at group boundaries, (2) reports micro-benchmark measurements for group sizes of 2, 4, and 8 layers on the evaluated models, and (3) shows that the incremental overhead remains below 5 % of total memory traffic and is more than offset by the elimination of redundant expert weight loads. These additions will directly support the reported TTFT and latency gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scheduling improvements rest on measurements, not self-referential derivation

full rationale

The paper introduces layered prefill as a scheduling paradigm that partitions the model into contiguous layer groups and interleaves prefill/decode to avoid MoE weight reloads while preserving stall-free operation. All reported gains (up to 70% TTFT reduction, 41% end-to-end latency, 22% energy) are presented as results of evaluations and direct measurements on the proposed system. No equations, fitted parameters, or first-principles derivations appear in the abstract or description that would allow a quantity to be redefined in terms of itself. The design choice of treating layer groups as atomic units is an engineering assumption whose validity is checked empirically rather than derived by construction from prior results or self-citations. The derivation chain is therefore self-contained and independent of the target performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about transformer layer dependencies and MoE expert activation patterns; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (2)

domain assumption Transformer layers must execute sequentially within a forward pass; data dependencies between consecutive layers cannot be violated by the scheduler.
Implicit in any layer-group interleaving scheme; if false, the proposed vertical partitioning would produce incorrect outputs.
domain assumption Expert weights for a given layer group remain resident in on-chip memory for the duration of that group's prefill or decode work.
Required for the elimination of redundant off-chip loads; stated as the mechanism that removes chunk-induced reloads.

pith-pipeline@v0.9.0 · 5811 in / 1645 out tokens · 26746 ms · 2026-05-18T08:44:10.942242+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., et al. gpt-oss-120b & gpt-oss-20b Model Card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

URL https: //arxiv.org/abs/2004.05150. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., and Goharian, N. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications.arXiv preprint arXiv:2505.07203,

Du, K., Wang, B., Zhang, C., Cheng, Y ., Lan, Q., Sang, H., Cheng, Y ., Yao, J., Liu, X., Qiao, Y ., Stoica, I., and Jiang, J. PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications.arXiv preprint arXiv:2505.07203,

work page arXiv
[6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles

Associa- tion for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi. org/10.1145/3600006.3613165. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.International Conference on Learning R...

work page doi:10.1145/3600006.3613165
[9]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill Mitra, T., Borkar, R., Bhatia, N., Matas, R., Raj, S., Mudi- gere, D., Zhao, R., Golub, M., Dutta, A., Madduri, S., Jani, D., Pharris, B., and Rouhani, B. D. Beyond the Buzz: A Pragmatic Take on Inference Disaggregation. arXiv preprint arXiv:2506.05508,

work page arXiv
[11]

Park, J., Choi, J., Kyung, K., Kim, M

Accessed: 2025-10-01. Park, J., Choi, J., Kyung, K., Kim, M. J., Kwon, Y ., Kim, N. S., and Ahn, J. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pp. 103– 119,

work page 2025
[12]

Raha, A., Mathaikutty, D

URLhttps://arxiv.org/abs/2211.05102. Raha, A., Mathaikutty, D. A., Kundu, S., and Ghosh, S. K. FlexNPU: A dataflow-aware flexible deep learning ac- celerator for energy-efficient edge devices.Frontiers in High Performance Computing, 3:1570210,

work page arXiv
[13]

URL https://sharegpt. com/. Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[14]

Stojkovic, J., Zhang, C., Goiri, ´I., Torrellas, J., and Choukse, E

URL https://arxiv.org/abs/ 2508.16712. Stojkovic, J., Zhang, C., Goiri, ´I., Torrellas, J., and Choukse, E. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE In- ternational Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. IEEE,

work page arXiv
[15]

and Wong, R

Tirumala, A. and Wong, R. Nvidia blackwell platform: Advancing generative ai and accelerated computing. In 2024 IEEE Hot Chips 36 Symposium (HCS), pp. 1–33. IEEE Computer Society,

work page 2024
[16]

Prefill-Decode Aggregation or Disaggregation? Uni- fying Both for Goodput-Optimized LLM Serving.arXiv preprint arXiv:2508.01989, 2025a

Wang, C., Zuo, P., Chen, Z., Liang, Y ., Yu, Z., and Yang, M.- C. Prefill-Decode Aggregation or Disaggregation? Uni- fying Both for Goodput-Optimized LLM Serving.arXiv preprint arXiv:2508.01989, 2025a. Wang, D., Liu, B., Lu, R., Zhang, Z., and Zhu, S. StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices....

work page arXiv
[17]

Memory is all you need: An overview of compute- in-memory architectures for accelerating large language model inference.arXiv preprint arXiv:2406.08413,

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill Wolters, C., Yang, X., Schlichtmann, U., and Suzumura, T. Memory is all you need: An overview of compute- in-memory architectures for accelerating large language model inference.arXiv preprint arXiv:2406.08413,

work page arXiv
[18]

Qwen3 Technical Report

URL https://arxiv.org/ abs/2505.09388. Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.- G. Orca: A distributed serving system for {Transformer- Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 521–538,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H

URLhttps://arxiv.org/abs/2507.15465. Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24), pp. 193–210,

work page arXiv
[20]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R. K., Bai, Y ., Baker, B., Bao, H., et al. gpt-oss-120b & gpt-oss-20b Model Card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Longformer: The Long-Document Transformer

URL https: //arxiv.org/abs/2004.05150. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J...

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Cohan, A., Dernoncourt, F., Kim, D. S., Bui, T., Kim, S., Chang, W., and Goharian, N. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications.arXiv preprint arXiv:2505.07203,

Du, K., Wang, B., Zhang, C., Cheng, Y ., Lan, Q., Sang, H., Cheng, Y ., Yao, J., Liu, X., Qiao, Y ., Stoica, I., and Jiang, J. PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications.arXiv preprint arXiv:2505.07203,

work page arXiv

[6] [6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles

Associa- tion for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi. org/10.1145/3600006.3613165. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding.International Conference on Learning R...

work page doi:10.1145/3600006.3613165

[9] [9]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill Mitra, T., Borkar, R., Bhatia, N., Matas, R., Raj, S., Mudi- gere, D., Zhao, R., Golub, M., Dutta, A., Madduri, S., Jani, D., Pharris, B., and Rouhani, B. D. Beyond the Buzz: A Pragmatic Take on Inference Disaggregation. arXiv preprint arXiv:2506.05508,

work page arXiv

[11] [11]

Park, J., Choi, J., Kyung, K., Kim, M

Accessed: 2025-10-01. Park, J., Choi, J., Kyung, K., Kim, M. J., Kwon, Y ., Kim, N. S., and Ahn, J. AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pp. 103– 119,

work page 2025

[12] [12]

Raha, A., Mathaikutty, D

URLhttps://arxiv.org/abs/2211.05102. Raha, A., Mathaikutty, D. A., Kundu, S., and Ghosh, S. K. FlexNPU: A dataflow-aware flexible deep learning ac- celerator for energy-efficient edge devices.Frontiers in High Performance Computing, 3:1570210,

work page arXiv

[13] [13]

URL https://sharegpt. com/. Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[14] [14]

Stojkovic, J., Zhang, C., Goiri, ´I., Torrellas, J., and Choukse, E

URL https://arxiv.org/abs/ 2508.16712. Stojkovic, J., Zhang, C., Goiri, ´I., Torrellas, J., and Choukse, E. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE In- ternational Symposium on High Performance Computer Architecture (HPCA), pp. 1348–1362. IEEE,

work page arXiv

[15] [15]

and Wong, R

Tirumala, A. and Wong, R. Nvidia blackwell platform: Advancing generative ai and accelerated computing. In 2024 IEEE Hot Chips 36 Symposium (HCS), pp. 1–33. IEEE Computer Society,

work page 2024

[16] [16]

Prefill-Decode Aggregation or Disaggregation? Uni- fying Both for Goodput-Optimized LLM Serving.arXiv preprint arXiv:2508.01989, 2025a

Wang, C., Zuo, P., Chen, Z., Liang, Y ., Yu, Z., and Yang, M.- C. Prefill-Decode Aggregation or Disaggregation? Uni- fying Both for Goodput-Optimized LLM Serving.arXiv preprint arXiv:2508.01989, 2025a. Wang, D., Liu, B., Lu, R., Zhang, Z., and Zhu, S. StoreLLM: Energy Efficient Large Language Model Inference with Permanently Pre-stored Attention Matrices....

work page arXiv

[17] [17]

Memory is all you need: An overview of compute- in-memory architectures for accelerating large language model inference.arXiv preprint arXiv:2406.08413,

From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill Wolters, C., Yang, X., Schlichtmann, U., and Suzumura, T. Memory is all you need: An overview of compute- in-memory architectures for accelerating large language model inference.arXiv preprint arXiv:2406.08413,

work page arXiv

[18] [18]

Qwen3 Technical Report

URL https://arxiv.org/ abs/2505.09388. Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.- G. Orca: A distributed serving system for {Transformer- Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pp. 521–538,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H

URLhttps://arxiv.org/abs/2507.15465. Zhong, Y ., Liu, S., Chen, J., Hu, J., Zhu, Y ., Liu, X., Jin, X., and Zhang, H. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24), pp. 193–210,

work page arXiv

[20] [20]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022