NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.
citing papers explorer
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
-
Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.