pith. sign in
Pith Number

pith:SGRHL77B

pith:2026:SGRHL77BNAP25FGDUODECICWAT
not attested not anchored not stored refs resolved

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Aurick Qiao, Juncheng Yang, Karthik Ganesan, Olatunji Ruwase, Samyam Rajbhandari, Yue Cheng, Yuxiong He, Zhaoyuan Su

MoE prefill serving eliminates redundant overheads by asynchronously gathering expert weights during compute-bound phases.

arxiv:2605.02960 v2 · 2026-05-03 · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{SGRHL77BNAP25FGDUODECICWAT}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On Qwen3-235B-A22B across four hardware/precision configurations, MoE-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

C2weakest assumption

The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation without new bottlenecks or accuracy loss.

C3one line summary

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

References

75 extracted · 75 resolved · 16 Pith anchors

[1] gpt-oss-120b & gpt-oss-20b Model Card 2025 · arXiv:2508.10925
[2] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills 2023 · arXiv:2308.16369
[3] Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale 2022
[4] A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025 2025
[5] Moe-lightning: High-throughput moe inference on memory-constrained gpus 2025
Receipt and verification
First computed 2026-05-20T00:00:40.477498Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

91a275ffe1681fae94c3a38641205604f0c2f8207ac4823a618be80750689201

Aliases

arxiv: 2605.02960 · arxiv_version: 2605.02960v2 · doi: 10.48550/arxiv.2605.02960 · pith_short_12: SGRHL77BNAP2 · pith_short_16: SGRHL77BNAP25FGD · pith_short_8: SGRHL77B
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/SGRHL77BNAP25FGDUODECICWAT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 91a275ffe1681fae94c3a38641205604f0c2f8207ac4823a618be80750689201
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "bf1f9ca812600cc95e0be27a7a4414028f4b550c704c5c74d72538c538b50d81",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-03T03:10:24Z",
    "title_canon_sha256": "8f2d740a74c55c27cdcf58160f8ba9005e393ea49b602d7e6f62e60564e6bf54"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.02960",
    "kind": "arxiv",
    "version": 2
  }
}