FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Chuanfei Xu; Gong Zhang; Hong Liu; Hongming Huang; Hui Zang; Jiajia Chu; Jianfeng Wang; Jiawei Qiu; Jiongjiong Gu; Lin Zhang

arxiv: 2606.04415 · v1 · pith:STQ5BCYKnew · submitted 2026-06-03 · 💻 cs.DC

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Jiongjiong Gu , Jianfeng Wang , Zidong Han , Yongqiao Wang , Pengfei Xia , Mingjie Zhang , Hong Liu , Yuanyi Xia

show 14 more authors

Jiajia Chu Yifeng Tang Hui Zang Xin Yao Qijie Qiu Yuzhao Wang Chuanfei Xu Lin Zhang Zhuonan Lai Hongming Huang Jiawei Qiu Gong Zhang Zhong Ming Weipeng Cao

This is my paper

Pith reviewed 2026-06-28 04:55 UTC · model grok-4.3

classification 💻 cs.DC

keywords NPU virtualizationLLM servingprefill-decode co-locationAscend NPUtransparent virtualizationdynamic schedulingAI inferencephase-aware scheduling

0 comments

The pith

FlexNPU interposes on AscendCL APIs to enable dynamic prefill-decode co-location on NPUs without modifying applications or drivers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlexNPU as a user-space virtualization layer that routes NPU operations through per-device daemons. This boundary decouples applications from physical devices and permits phase-aware scheduling that moves between compute-intensive prefill and memory-bound decode stages. A sympathetic reader would care because static disaggregation creates resource imbalance while direct device exposure prevents runtime adaptation. If the approach holds, clusters can use the same hardware for both phases with lower first-token latency and no extra data movement. The work demonstrates these gains on real Ascend 910C deployments running DeepSeek-R1 and Qwen2.5 models.

Core claim

FlexNPU is a transparent virtualization layer that interposes on AscendCL APIs and routes operations through per-device daemons, decoupling unmodified applications from physical NPU devices. The resulting runtime boundary supports virtualized objects, controlled operator dispatch, and dynamic PD co-location that adapts scheduling to the complementary characteristics of prefill and decode phases. On a 384-card Ascend 910C cluster running DeepSeek-R1, the system raises throughput 5.15 percent and 26.33 percent above static PD disaggregation; on Qwen2.5-7B it keeps throughput comparable to static co-location while cutting TTFT by more than 92 percent with nearly unchanged TPOT.

What carries the argument

Transparent interposition on AscendCL APIs routed through per-device daemons, which virtualizes NPU objects and controls operator dispatch for phase-aware scheduling.

If this is right

Dynamic PD co-location adapts scheduling to the complementary resource demands of prefill and decode without extra data movement.
The same hardware pool can serve both phases while preserving unmodified model code, frameworks, and NPU drivers.
Throughput rises over static disaggregation while TTFT drops sharply with little change to TPOT.
No measurable overhead appears relative to direct device passthrough in the tested workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same virtualization boundary could support other phase-aware policies such as mixed batching or priority preemption if the daemon layer is extended.
Clusters that currently maintain separate prefill and decode hardware pools might consolidate onto fewer devices if dynamic co-location scales reliably.
The approach would become more attractive for production if the daemon routing could be made portable across additional NPU vendors without new interposition code.

Load-bearing premise

Interposing on AscendCL APIs and routing through per-device daemons can be performed transparently with no measurable overhead or compatibility breakage for unmodified applications and drivers.

What would settle it

A side-by-side measurement on the same 384-card Ascend 910C cluster that shows either measurable inference slowdown or application breakage under FlexNPU compared with direct passthrough would falsify the transparency claim.

read the original abstract

Modern AI serving increasingly relies on NPUs for conventional inference and large language model serving. However, current NPU deployments commonly expose physical devices directly to applications, which limits runtime control over scheduling and makes it difficult to adapt execution to phase-level workload behavior. This limitation is particularly evident in LLM serving, where the prefill phase is compute-intensive while the decode phase is often constrained by memory bandwidth and KV-cache accesses. Static prefill-decode (PD) disaggregation reduces phase interference, but can introduce resource imbalance and unnecessary data movement. We present FlexNPU, a transparent user-space virtualization layer for Ascend NPUs. FlexNPU interposes on AscendCL APIs and routes NPU operations through per-device daemons, decoupling unmodified from physical NPU devices without modifying model code, AI frameworks, or NPU drivers. This runtime boundary allows FlexNPU to virtualize NPU objects, control operator dispatch, and support phase-aware scheduling for LLM serving. In particular, FlexNPU enables dynamic PD co-location, which adapts scheduling between prefill and decode according to their complementary resource characteristics. We implement FlexNPU on Huawei Ascend NPUs and evaluate it with typical LLM workloads. Compared with direct NPU passthrough, FlexNPU introduces no measurable inference overhead and slightly improves throughput in some scenarios. On a 384-card Ascend 910C deployment of DeepSeek-R1, FlexNPU improves throughput over static PD disaggregation by 5.15% and 26.33%. On Qwen2.5-7B, compared with static PD co-location, FlexNPU maintains comparable throughput while reducing TTFT by over 92% across tested workloads with nearly unchanged TPOT. These results show that transparent NPU virtualization is a practical substrate for efficient and responsive LLM serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexNPU adds a user-space daemon layer for Ascend NPUs that enables dynamic prefill-decode co-location without code changes, but the abstract supplies no experimental details to back the zero-overhead or throughput claims.

read the letter

FlexNPU describes a user-space virtualization layer for Ascend NPUs that routes AscendCL calls through per-device daemons. This lets unmodified applications run with phase-aware scheduling for LLM prefill and decode on the same devices.

The new element is the specific combination of transparent interposition, object virtualization, and dynamic co-location support on Ascend hardware. The paper correctly notes that static disaggregation can waste resources while pure co-location creates interference, and it positions the daemon boundary as a way to adapt dispatch at runtime.

It reports no measurable overhead versus direct passthrough, plus 5-26% throughput gains over static disaggregation on a 384-card DeepSeek-R1 run and over 92% TTFT reduction on Qwen2.5-7B with stable TPOT. Those numbers, if they hold, would matter for large Ascend deployments.

The soft spot is the complete absence of experimental information. The abstract gives no workload details, baseline implementations, measurement methods, or microbenchmarks isolating daemon latency. The central assumption—that interposition adds nothing measurable—remains untested in the provided text. If the routing adds even modest per-call cost or changes dispatch behavior, both the no-overhead claim and the dynamic benefit become unreliable.

This is for systems people running LLM serving on Ascend NPUs who need flexible resource control. A reader working on accelerator virtualization would find the design points worth examining.

It deserves peer review. The scale and the practical goal are worth referee time, but the paper needs full methods and validation sections before the results can be assessed.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexNPU, a transparent user-space virtualization layer for Ascend NPUs that interposes on AscendCL APIs and routes NPU operations through per-device daemons. This decouples unmodified applications from physical devices without changes to model code, AI frameworks, or drivers, enabling phase-aware scheduling for dynamic prefill-decode co-location in LLM serving. The work claims zero measurable overhead versus direct passthrough and reports specific gains: 5.15% and 26.33% throughput improvement over static PD disaggregation on a 384-card Ascend 910C deployment of DeepSeek-R1, plus >92% TTFT reduction with comparable throughput on Qwen2.5-7B versus static co-location.

Significance. If the interposition layer truly incurs no measurable overhead and maintains full compatibility, the result demonstrates a practical runtime substrate for adaptive NPU scheduling that addresses phase interference and resource imbalance in LLM inference. The evaluation on production-scale models and a large 384-card cluster provides concrete evidence of deployability on real Ascend hardware.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The claim that FlexNPU 'introduces no measurable inference overhead' is load-bearing for all reported gains, yet no microbenchmarks isolating interposition or daemon-routing latency are described, nor is there an enumeration of intercepted AscendCL calls or a compatibility matrix with unmodified drivers and frameworks.
[Abstract] Abstract: The headline results (5.15%/26.33% throughput lift on DeepSeek-R1; >92% TTFT reduction on Qwen2.5-7B) are presented without any description of experimental methodology, baseline details, workload characteristics, run counts, or error bars, preventing verification that the data support the claims.

minor comments (1)

[Abstract] The abstract would benefit from a brief sentence clarifying the scope of 'unmodified applications' (e.g., whether it includes all common inference frameworks).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight areas where additional detail will strengthen the manuscript. We address each point below and will incorporate the requested changes in the revised version.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The claim that FlexNPU 'introduces no measurable inference overhead' is load-bearing for all reported gains, yet no microbenchmarks isolating interposition or daemon-routing latency are described, nor is there an enumeration of intercepted AscendCL calls or a compatibility matrix with unmodified drivers and frameworks.

Authors: We agree that the current version does not include dedicated microbenchmarks isolating interposition and daemon-routing latency, nor an explicit enumeration of intercepted AscendCL calls or a compatibility matrix. In the revision we will add a new subsection to the Evaluation section containing (1) microbenchmarks for the latency of the interposed AscendCL path versus direct passthrough, (2) a table listing all intercepted calls, and (3) a compatibility matrix covering the tested frameworks and drivers. These additions will directly support the zero-overhead claim with quantitative evidence. revision: yes
Referee: [Abstract] Abstract: The headline results (5.15%/26.33% throughput lift on DeepSeek-R1; >92% TTFT reduction on Qwen2.5-7B) are presented without any description of experimental methodology, baseline details, workload characteristics, run counts, or error bars, preventing verification that the data support the claims.

Authors: The abstract is intentionally concise, but we acknowledge that the headline numbers require supporting methodological context for verifiability. In the revised manuscript we will (a) augment the abstract with brief statements on run counts and error-bar reporting, and (b) ensure the Evaluation section supplies complete baseline descriptions, workload characteristics, number of runs, and statistical error bars for all reported figures. These changes will allow readers to assess the strength of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems description with no derivations or fitted predictions

full rationale

The paper presents an implementation of a virtualization layer for Ascend NPUs and reports empirical throughput and latency measurements on specific workloads. No equations, parameter fits, uniqueness theorems, or derivation chains appear in the abstract or described content. All claims reduce to direct measurements against baselines rather than any self-referential construction or renamed inputs. The central performance numbers are presented as observed outcomes of the implemented system, with no load-bearing self-citation or ansatz that collapses the result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems implementation paper. The abstract contains no mathematical derivations, fitted constants, or postulated entities.

pith-pipeline@v0.9.1-grok · 5945 in / 1208 out tokens · 32368 ms · 2026-06-28T04:55:03.877398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

[1]

Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs

EVALUATION We evaluate FlexNPU to answer the following questions: (1) Virtualization overhead. Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs. static PD disaggregation. Can FlexNPU improve large-scale LLM serving throughput by dynamically co-locating prefill and decode compared with static PD...

2048
[2]

Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Hong, Ke, et al. "Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage." arXiv preprint arXiv:2504.19867,

arXiv
[3]

Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs

Xu, Yechen, et al. "Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs." arXiv preprint arXiv:2601.11743,

arXiv
[4]

StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

Wu, Hao, et al. "StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow." 2024 USENIX Annual Technical Conference,

2024
[5]

Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads

Shukla, Dharma, et al. "Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads." arXiv preprint arXiv:2202.07848,

arXiv
[6]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Yu, Shan, et al. "Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving." arXiv preprint arXiv:2505.04021,

Pith/arXiv arXiv
[7]

Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing

Li, Jie, et al. "Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing." 2022 USENIX Annual Technical Conference,

2022
[8]

Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading

Sui, Yifan, et al. "Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading." Proceedings of the 2024 ACM Symposium on Cloud Computing,

2024
[9]

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

Yu, Minchen, et al. "Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference." 2025 USENIX Annual Technical Conference,

2025
[10]

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

Patel, Pratyush, et al. "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture,

2024
[11]

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Duan, Jiangfei, et al. "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." arXiv preprint arXiv:2404.02015,

arXiv
[12]

Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Shi, Xiaoxiang, et al. "Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving." arXiv preprint arXiv:2507.06608,

arXiv
[13]

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Gao, Lei, et al. "DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing." arXiv preprint arXiv:2511.04791,

Pith/arXiv arXiv
[14]

RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

Kasnavieh, Hossein Hosseini, et al. "RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving." arXiv preprint arXiv:2604.10907,

Pith/arXiv arXiv

[1] [1]

Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs

EVALUATION We evaluate FlexNPU to answer the following questions: (1) Virtualization overhead. Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs. static PD disaggregation. Can FlexNPU improve large-scale LLM serving throughput by dynamically co-locating prefill and decode compared with static PD...

2048

[2] [2]

Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Hong, Ke, et al. "Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage." arXiv preprint arXiv:2504.19867,

arXiv

[3] [3]

Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs

Xu, Yechen, et al. "Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs." arXiv preprint arXiv:2601.11743,

arXiv

[4] [4]

StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

Wu, Hao, et al. "StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow." 2024 USENIX Annual Technical Conference,

2024

[5] [5]

Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads

Shukla, Dharma, et al. "Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads." arXiv preprint arXiv:2202.07848,

arXiv

[6] [6]

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Yu, Shan, et al. "Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving." arXiv preprint arXiv:2505.04021,

Pith/arXiv arXiv

[7] [7]

Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing

Li, Jie, et al. "Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing." 2022 USENIX Annual Technical Conference,

2022

[8] [8]

Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading

Sui, Yifan, et al. "Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading." Proceedings of the 2024 ACM Symposium on Cloud Computing,

2024

[9] [9]

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

Yu, Minchen, et al. "Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference." 2025 USENIX Annual Technical Conference,

2025

[10] [10]

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

Patel, Pratyush, et al. "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture,

2024

[11] [11]

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Duan, Jiangfei, et al. "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." arXiv preprint arXiv:2404.02015,

arXiv

[12] [12]

Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Shi, Xiaoxiang, et al. "Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving." arXiv preprint arXiv:2507.06608,

arXiv

[13] [13]

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Gao, Lei, et al. "DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing." arXiv preprint arXiv:2511.04791,

Pith/arXiv arXiv

[14] [14]

RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

Kasnavieh, Hossein Hosseini, et al. "RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving." arXiv preprint arXiv:2604.10907,

Pith/arXiv arXiv