pith. sign in

arxiv: 2606.04415 · v1 · pith:STQ5BCYKnew · submitted 2026-06-03 · 💻 cs.DC

FlexNPU: Transparent NPU Virtualization for Dynamic LLM Prefill-Decode Co-location

Pith reviewed 2026-06-28 04:55 UTC · model grok-4.3

classification 💻 cs.DC
keywords NPU virtualizationLLM servingprefill-decode co-locationAscend NPUtransparent virtualizationdynamic schedulingAI inferencephase-aware scheduling
0
0 comments X

The pith

FlexNPU interposes on AscendCL APIs to enable dynamic prefill-decode co-location on NPUs without modifying applications or drivers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FlexNPU as a user-space virtualization layer that routes NPU operations through per-device daemons. This boundary decouples applications from physical devices and permits phase-aware scheduling that moves between compute-intensive prefill and memory-bound decode stages. A sympathetic reader would care because static disaggregation creates resource imbalance while direct device exposure prevents runtime adaptation. If the approach holds, clusters can use the same hardware for both phases with lower first-token latency and no extra data movement. The work demonstrates these gains on real Ascend 910C deployments running DeepSeek-R1 and Qwen2.5 models.

Core claim

FlexNPU is a transparent virtualization layer that interposes on AscendCL APIs and routes operations through per-device daemons, decoupling unmodified applications from physical NPU devices. The resulting runtime boundary supports virtualized objects, controlled operator dispatch, and dynamic PD co-location that adapts scheduling to the complementary characteristics of prefill and decode phases. On a 384-card Ascend 910C cluster running DeepSeek-R1, the system raises throughput 5.15 percent and 26.33 percent above static PD disaggregation; on Qwen2.5-7B it keeps throughput comparable to static co-location while cutting TTFT by more than 92 percent with nearly unchanged TPOT.

What carries the argument

Transparent interposition on AscendCL APIs routed through per-device daemons, which virtualizes NPU objects and controls operator dispatch for phase-aware scheduling.

If this is right

  • Dynamic PD co-location adapts scheduling to the complementary resource demands of prefill and decode without extra data movement.
  • The same hardware pool can serve both phases while preserving unmodified model code, frameworks, and NPU drivers.
  • Throughput rises over static disaggregation while TTFT drops sharply with little change to TPOT.
  • No measurable overhead appears relative to direct device passthrough in the tested workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same virtualization boundary could support other phase-aware policies such as mixed batching or priority preemption if the daemon layer is extended.
  • Clusters that currently maintain separate prefill and decode hardware pools might consolidate onto fewer devices if dynamic co-location scales reliably.
  • The approach would become more attractive for production if the daemon routing could be made portable across additional NPU vendors without new interposition code.

Load-bearing premise

Interposing on AscendCL APIs and routing through per-device daemons can be performed transparently with no measurable overhead or compatibility breakage for unmodified applications and drivers.

What would settle it

A side-by-side measurement on the same 384-card Ascend 910C cluster that shows either measurable inference slowdown or application breakage under FlexNPU compared with direct passthrough would falsify the transparency claim.

read the original abstract

Modern AI serving increasingly relies on NPUs for conventional inference and large language model serving. However, current NPU deployments commonly expose physical devices directly to applications, which limits runtime control over scheduling and makes it difficult to adapt execution to phase-level workload behavior. This limitation is particularly evident in LLM serving, where the prefill phase is compute-intensive while the decode phase is often constrained by memory bandwidth and KV-cache accesses. Static prefill-decode (PD) disaggregation reduces phase interference, but can introduce resource imbalance and unnecessary data movement. We present FlexNPU, a transparent user-space virtualization layer for Ascend NPUs. FlexNPU interposes on AscendCL APIs and routes NPU operations through per-device daemons, decoupling unmodified from physical NPU devices without modifying model code, AI frameworks, or NPU drivers. This runtime boundary allows FlexNPU to virtualize NPU objects, control operator dispatch, and support phase-aware scheduling for LLM serving. In particular, FlexNPU enables dynamic PD co-location, which adapts scheduling between prefill and decode according to their complementary resource characteristics. We implement FlexNPU on Huawei Ascend NPUs and evaluate it with typical LLM workloads. Compared with direct NPU passthrough, FlexNPU introduces no measurable inference overhead and slightly improves throughput in some scenarios. On a 384-card Ascend 910C deployment of DeepSeek-R1, FlexNPU improves throughput over static PD disaggregation by 5.15% and 26.33%. On Qwen2.5-7B, compared with static PD co-location, FlexNPU maintains comparable throughput while reducing TTFT by over 92% across tested workloads with nearly unchanged TPOT. These results show that transparent NPU virtualization is a practical substrate for efficient and responsive LLM serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexNPU, a transparent user-space virtualization layer for Ascend NPUs that interposes on AscendCL APIs and routes NPU operations through per-device daemons. This decouples unmodified applications from physical devices without changes to model code, AI frameworks, or drivers, enabling phase-aware scheduling for dynamic prefill-decode co-location in LLM serving. The work claims zero measurable overhead versus direct passthrough and reports specific gains: 5.15% and 26.33% throughput improvement over static PD disaggregation on a 384-card Ascend 910C deployment of DeepSeek-R1, plus >92% TTFT reduction with comparable throughput on Qwen2.5-7B versus static co-location.

Significance. If the interposition layer truly incurs no measurable overhead and maintains full compatibility, the result demonstrates a practical runtime substrate for adaptive NPU scheduling that addresses phase interference and resource imbalance in LLM inference. The evaluation on production-scale models and a large 384-card cluster provides concrete evidence of deployability on real Ascend hardware.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The claim that FlexNPU 'introduces no measurable inference overhead' is load-bearing for all reported gains, yet no microbenchmarks isolating interposition or daemon-routing latency are described, nor is there an enumeration of intercepted AscendCL calls or a compatibility matrix with unmodified drivers and frameworks.
  2. [Abstract] Abstract: The headline results (5.15%/26.33% throughput lift on DeepSeek-R1; >92% TTFT reduction on Qwen2.5-7B) are presented without any description of experimental methodology, baseline details, workload characteristics, run counts, or error bars, preventing verification that the data support the claims.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief sentence clarifying the scope of 'unmodified applications' (e.g., whether it includes all common inference frameworks).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight areas where additional detail will strengthen the manuscript. We address each point below and will incorporate the requested changes in the revised version.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The claim that FlexNPU 'introduces no measurable inference overhead' is load-bearing for all reported gains, yet no microbenchmarks isolating interposition or daemon-routing latency are described, nor is there an enumeration of intercepted AscendCL calls or a compatibility matrix with unmodified drivers and frameworks.

    Authors: We agree that the current version does not include dedicated microbenchmarks isolating interposition and daemon-routing latency, nor an explicit enumeration of intercepted AscendCL calls or a compatibility matrix. In the revision we will add a new subsection to the Evaluation section containing (1) microbenchmarks for the latency of the interposed AscendCL path versus direct passthrough, (2) a table listing all intercepted calls, and (3) a compatibility matrix covering the tested frameworks and drivers. These additions will directly support the zero-overhead claim with quantitative evidence. revision: yes

  2. Referee: [Abstract] Abstract: The headline results (5.15%/26.33% throughput lift on DeepSeek-R1; >92% TTFT reduction on Qwen2.5-7B) are presented without any description of experimental methodology, baseline details, workload characteristics, run counts, or error bars, preventing verification that the data support the claims.

    Authors: The abstract is intentionally concise, but we acknowledge that the headline numbers require supporting methodological context for verifiability. In the revised manuscript we will (a) augment the abstract with brief statements on run counts and error-bar reporting, and (b) ensure the Evaluation section supplies complete baseline descriptions, workload characteristics, number of runs, and statistical error bars for all reported figures. These changes will allow readers to assess the strength of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems description with no derivations or fitted predictions

full rationale

The paper presents an implementation of a virtualization layer for Ascend NPUs and reports empirical throughput and latency measurements on specific workloads. No equations, parameter fits, uniqueness theorems, or derivation chains appear in the abstract or described content. All claims reduce to direct measurements against baselines rather than any self-referential construction or renamed inputs. The central performance numbers are presented as observed outcomes of the implemented system, with no load-bearing self-citation or ansatz that collapses the result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems implementation paper. The abstract contains no mathematical derivations, fitted constants, or postulated entities.

pith-pipeline@v0.9.1-grok · 5945 in / 1208 out tokens · 32368 ms · 2026-06-28T04:55:03.877398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 3 linked inside Pith

  1. [1]

    Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs

    EVALUATION We evaluate FlexNPU to answer the following questions: (1) Virtualization overhead. Does FlexNPU introduce noticeable overhead compared with direct NPU passthrough? (2) Dynamic PD co-location vs. static PD disaggregation. Can FlexNPU improve large-scale LLM serving throughput by dynamically co-locating prefill and decode compared with static PD...

  2. [2]

    Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

    Hong, Ke, et al. "Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage." arXiv preprint arXiv:2504.19867,

  3. [3]

    Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs

    Xu, Yechen, et al. "Nixie: Efficient, Transparent Temporal Multiplexing for Consumer GPUs." arXiv preprint arXiv:2601.11743,

  4. [4]

    StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow

    Wu, Hao, et al. "StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow." 2024 USENIX Annual Technical Conference,

  5. [5]

    Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads

    Shukla, Dharma, et al. "Singularity: Planet-scale, Preemptive and Elastic Scheduling of AI Workloads." arXiv preprint arXiv:2202.07848,

  6. [6]

    Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

    Yu, Shan, et al. "Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving." arXiv preprint arXiv:2505.04021,

  7. [7]

    Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing

    Li, Jie, et al. "Tetris: Memory-Efficient Serverless Inference Through Tensor Sharing." 2022 USENIX Annual Technical Conference,

  8. [8]

    Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading

    Sui, Yifan, et al. "Pre-warming Is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-loading." Proceedings of the 2024 ACM Symposium on Cloud Computing,

  9. [9]

    Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

    Yu, Minchen, et al. "Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference." 2025 USENIX Annual Technical Conference,

  10. [10]

    Splitwise: Efficient Generative LLM Inference Using Phase Splitting

    Patel, Pratyush, et al. "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture,

  11. [11]

    MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

    Duan, Jiangfei, et al. "MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving." arXiv preprint arXiv:2404.02015,

  12. [12]

    Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

    Shi, Xiaoxiang, et al. "Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving." arXiv preprint arXiv:2507.06608,

  13. [13]

    DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

    Gao, Lei, et al. "DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing." arXiv preprint arXiv:2511.04791,

  14. [14]

    RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

    Kasnavieh, Hossein Hosseini, et al. "RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving." arXiv preprint arXiv:2604.10907,