Title resolution pending

URLhttps://arxiv · 2025 · arXiv 2507.15465

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

cs.NI · 2026-04-30 · unverdicted · novelty 6.0

Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.

Janus: Disaggregating Attention and Experts for Scalable MoE Inference

cs.DC · 2025-12-15 · unverdicted · novelty 6.0

JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

cs.LG · 2025-10-09 · unverdicted · novelty 6.0

Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.

Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO

cs.DC · 2026-04-13 · unverdicted · novelty 4.0

StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

citing papers explorer

Showing 4 of 4 citing papers.

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving cs.NI · 2026-04-30 · unverdicted · none · ref 69
Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
Janus: Disaggregating Attention and Experts for Scalable MoE Inference cs.DC · 2025-12-15 · unverdicted · none · ref 35
JANUS disaggregates attention and MoE layers onto separate GPU pools with an expert-balancing scheduler and SLO-aware scaling, delivering up to 4.7x higher per-GPU throughput than prior MoE systems under token-level latency constraints.
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill cs.LG · 2025-10-09 · unverdicted · none · ref 19
Layered prefill replaces token-chunked prefill with layer-group interleaving in MoE models, cutting TTFT by up to 70%, end-to-end latency by 41%, and per-token energy by 22% while preserving stall-free TBT.
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO cs.DC · 2026-04-13 · unverdicted · none · ref 2
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer