Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc

Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo · 2026 · arXiv 2506.24045

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.

Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

cs.AR · 2026-04-13 · unverdicted · novelty 3.0

A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.

citing papers explorer

Showing 3 of 3 citing papers.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 44
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 73
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
Technology solutions targeting the performance of gen-AI inference in resource constrained platforms cs.AR · 2026-04-13 · unverdicted · none · ref 3
A roofline-based model is used to assess bandwidth and latency needs for High Bandwidth Storage in 13B-parameter models with long contexts and the utility of bonded memory chiplets for 1B-parameter models to ease capacity and bandwidth constraints in on-device gen-AI inference.

Agent.xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc

fields

years

verdicts

representative citing papers

citing papers explorer