MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Gmlake: Efficient and transparent gpu memory defragmentation for large-scale dnn training with virtual memory stitching
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AR 2years
2026 2verdicts
UNVERDICTED 2roles
background 2polarities
background 2representative citing papers
TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.
citing papers explorer
-
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
-
TokenStack: A Heterogeneous HBM-PIM Architecture and Runtime for Efficient LLM Inference
TokenStack's heterogeneous HBM-PIM design with base-die control and topology-aware KV placement delivers 1.62x higher geometric-mean token throughput and 1.70x SLO-compliant serving capacity than AttAcc while cutting per-token energy by 30-47%.