Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
Mixture of cache- conditional experts for efficient mobile device inference
2 Pith papers cite this work. Polarity classification is still indexing.
years
2025 2representative citing papers
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
citing papers explorer
-
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Comprehensive profiling of expert selection in frontier MoE models reveals temporal and spatial patterns that enable 6.6x speedup on wafer-scale GPUs and 1.25x on existing systems via targeted optimizations.
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.