Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande; Deep Shekhar; Marc Blackstein; Ram Rangan

arxiv: 2604.26334 · v1 · submitted 2026-04-29 · 💻 cs.DC · cs.AR· cs.LG

Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande , Deep Shekhar , Marc Blackstein , Ram Rangan This is my paper

Pith reviewed 2026-05-07 13:21 UTC · model grok-4.3

classification 💻 cs.DC cs.ARcs.LG

keywords pipelined shardingclient inferenceVRAM constrainedxLMCPU offloadinghybrid schedulingTTFT optimizationMoE inference

0 comments

The pith

Pipelined sharding uses sub-layer model partitioning and CPU-GPU pipelining to achieve efficient lossless xLM inference on VRAM-limited client devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces pipelined sharding as a benchmark-profile-guided hybrid scheduling method for running high-accuracy large language and vision-language models on client hardware. It shards models at the sub-layer level, offloads tensors to CPU, pipelines memory copies with computation, and prioritizes VRAM placement to balance quick first-token response with high generation speed while staying within memory limits. For vision-language models it layers on vision-tensor offloading, flash attention, and overlap avoidance. These steps target interactive and batched workloads on devices such as those running future releases of NVIDIA's IGI SDK and CR1 model. The resulting speedups and memory savings are measured against aggressive baselines across multiple models and client systems.

Core claim

Pipelined sharding performs model sharding at the sub-layer level, combines it with CPU offloading, pipelined copy-compute overlap, and prioritized tensor placement in VRAM. This hybrid CPU-GPU scheduler, guided by benchmark profiles, optimizes both time-to-first-token and tokens-per-second for dense and MoE LLMs under VRAM constraints while adapting to system and workload conditions. When augmented with vision tensor CPU offloading, flash attention, and VRAM overlap avoidance, the same approach yields low-memory VLM inference. Evaluation shows TTFT gains up to 6.7x, TPS gains up to 30x for interactive LLM use, 10x lower VRAM for CR1, and 8.2x higher batched throughput, all without accuracy

What carries the argument

Pipelined sharding: sub-layer model sharding combined with CPU offloading, pipelined copy-compute, and prioritized VRAM tensor placement, guided by benchmark profiles to optimize TTFT and TPS under memory limits.

If this is right

Interactive LLM inference achieves up to 6.7x lower TTFT and 30x higher TPS.
CR1 VLM inference runs with 10x lower VRAM demand.
Batched workloads obtain up to 8.2x higher throughput.
The scheduler adapts automatically to different system memory sizes and inference modes.
Lossless accuracy is preserved while meeting client VRAM budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The profile-guided decisions could be extended to on-device continuous learning that refreshes schedules from recent user workloads.
Combining pipelined sharding with emerging unified-memory hardware might further reduce explicit copy overhead.
The same sub-layer granularity could apply to other memory-bound client tasks such as real-time multimodal generation.

Load-bearing premise

That benchmark-profile-guided CPU-GPU hybrid scheduling and offloading will deliver consistent performance gains without unacceptable overhead or accuracy loss on client hardware and model variants outside the tested set.

What would settle it

Measure TTFT, TPS, and VRAM usage for the same models on untested client configurations such as AMD GPUs or systems with substantially different CPU-GPU bandwidth; any case showing no speedup or added latency overhead without quality loss would falsify the claim of reliable gains.

Figures

Figures reproduced from arXiv: 2604.26334 by Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan.

**Figure 1.** Figure 1: Pipelined Sharding Overview. In Step 1 , which runs at install time, we populate a profile database by running a collection of kernels on the CPU and the GPU. In Step 2 , which runs in the planning phase, the scheduler will pin some sub-layers based on priority and generate three possible schedule plans for the remaining sub-layers, namely, a) all of them reside in sysRAM and execute on GPU, b) some of the… view at source ↗

**Figure 2.** Figure 2: Speedups from pipelined sharding for LLMs on cli3, relative to llama-cpp-baseline. Each chart shows speedups across eight VRAM budgets (2G to 32G) for four models at four context sizes (1K, 4K, 16K, 64K). (1) Time-to-first-token (TTFT) speedups average 2× (max 6.7×). (2) Tokens-per-second (TPS) speedups average 3.7×, reaching up to 30× for qwen235b at 64K context (truncated values labeled above). (3) End-t… view at source ↗

**Figure 3.** Figure 3: TTFT and TPS speedups from pipelined sharding vs llama.cpp’s manual CPU offloading for qwen30b on cli3 across context sizes and VRAM budgets. Values >7× are labeled. Comparison with manual offloading: In view at source ↗

**Figure 4.** Figure 4: shows the scheduling decisions for nemo8b and qwen30b across CPU thread counts (2t and 8t), context sizes (4K and 16K), and VRAM budgets (2G, 4G, and 8G). With fewer threads (2t), the scheduler predominantly selects GPU-only execution. With more threads (8t), it shifts to Static or Dynamic plans that leverage CPU resources, demonstrating adaptivity to system conditions. 2t-4K-2G 2t-4K-4G 2t-4K-8G 2t-16K-2G… view at source ↗

**Figure 5.** Figure 5: Sensitivity studies on cli3. (a) TPS vs CPU thread count at 8G VRAM and 16K context. (b) TPS and (c) TTFT vs. PCIe generation at 16K context and 8G VRAM with 16 CPU threads. Lower TTFT is better. ment (13-17%) as intermediate outputs transfer over PCIe, but minimal TPS variation. Pipelined sharding’s TPS also shows modest gains (4-6%) as the scheduler selected a Static plan, entailing minimal PCIe activity… view at source ↗

**Figure 7.** Figure 7: Pipelined sharding batched mode performance on cli3 across batch sizes (4, 16, 64), context sizes (1K and 4K), and VRAM budgets (4G, 8G, 16G). Batch-wide TPS improves by 2.3× on average and up to 8.2×. With 1K context, TPS generally scales with batch size up to 64, with qwen30b reaching 289 TPS (ukv). For 4K, for qwen30b at all budgets and nemo8b at 16G, TPS drops with larger batch sizes at the point where… view at source ↗

read the original abstract

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama$.$cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7x and TPS by up to 30x for LLMs, and CR1 inference's VRAM demand is down by 10x, while in batched mode, throughput improves by up to 8.2x, all compared to their respective aggressive baselines. This paper is accepted at the 9th MLSys Conference (Industry Track), 2026. Code and artifact available at: https://github.com/deepshnv/pipeshard-mlsys26-ae

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pipelined sharding is a practical CPU-GPU hybrid scheduler that delivers large reported gains in TTFT, TPS, and VRAM use for client xLM inference, but the gains look tied to profiled hardware and need more detail on baselines and overhead.

read the letter

The core contribution is pipelined sharding: sub-layer model sharding plus prioritized VRAM placement, CPU offloading, and pipelined copy-compute, all driven by benchmark profiles. They pair it with VLMOpt (vision tensor offload, flash attention, and VRAM overlap avoidance) for vision-language models. The abstract claims this runs high-accuracy LLMs and VLMs on VRAM-constrained clients, with up to 6.7x better TTFT, 30x TPS for interactive LLM use, 8.2x throughput in batch mode, and 10x lower VRAM for CR1, all versus aggressive baselines. Code is released and the work is accepted at MLSys 2026 industry track, aimed at NVIDIA's IGI SDK and Cosmos-Reason1.

Referee Report

2 major / 2 minor

Summary. The paper proposes pipelined sharding, a benchmark-profile-guided CPU-GPU hybrid scheduling technique for VRAM-constrained inference of dense and MoE LLMs and VLMs on client systems. It combines sub-layer model sharding, CPU offloading, pipelined copy-compute overlap, and prioritized tensor placement in VRAM to jointly optimize TTFT and TPS while adapting to system conditions. For VLMs, it augments this with VLMOpt (vision tensor CPU offloading, flash attention, and vision-language VRAM overlap avoidance) implemented in llama.cpp. Evaluation across multiple models and client systems reports up to 6.7x TTFT and 30x TPS gains for interactive LLM use, 10x VRAM reduction for CR1 inference, and 8.2x batched throughput improvement versus aggressive baselines. The work targets future releases of NVIDIA's IGI SDK and Cosmos-Reason1 (CR1) VLM, with code and artifacts released.

Significance. If the empirical results hold under scrutiny, the work offers a practical advance for client-side xLM inference by enabling high-accuracy models under tight VRAM limits without accuracy loss. The combination of sharding, offloading, and pipelining addresses real deployment constraints on heterogeneous client hardware, with direct relevance to industry products. Strengths include the multi-model/system evaluation and public artifact release; the profile-guided adaptation is a pragmatic engineering contribution rather than a purely theoretical one.

major comments (2)

[Evaluation] Evaluation section: the central performance claims (6.7x TTFT, 30x TPS, 10x VRAM, 8.2x throughput) rest on comparisons to 'aggressive baselines' whose exact definitions, implementation details, and selection criteria are not fully specified. Without this, it is impossible to verify whether the reported gains are robust or sensitive to baseline choice.
[Pipelined Sharding] Pipelined sharding and scheduling description: the benchmark-profile-guided CPU-GPU hybrid decisions are presented as flexibly adapting to system and inference conditions, yet no data or analysis is given on profiling overhead, the cost of re-profiling when hardware changes, or whether the resulting schedule remains near-optimal for CPU-GPU interconnects, VRAM sizes, or model variants outside the evaluated set. This directly affects the generalizability of the lossless, low-overhead claims.

minor comments (2)

[Abstract] Abstract: 'llama$.$cpp' appears to be a formatting artifact for 'llama.cpp'; correct for clarity.
[Evaluation] The paper would benefit from an explicit table or figure summarizing the exact hardware configurations, model sizes, and baseline implementations used in each reported speedup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the major comments point by point below and will update the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central performance claims (6.7x TTFT, 30x TPS, 10x VRAM, 8.2x throughput) rest on comparisons to 'aggressive baselines' whose exact definitions, implementation details, and selection criteria are not fully specified. Without this, it is impossible to verify whether the reported gains are robust or sensitive to baseline choice.

Authors: We agree that the aggressive baselines require more explicit definition for reproducibility and to allow verification of the gains. In the revised manuscript we will expand the evaluation section with precise descriptions of each baseline, including their exact sharding levels, offloading policies, and selection criteria as the most aggressive feasible configurations under the same VRAM constraints. revision: yes
Referee: [Pipelined Sharding] Pipelined sharding and scheduling description: the benchmark-profile-guided CPU-GPU hybrid decisions are presented as flexibly adapting to system and inference conditions, yet no data or analysis is given on profiling overhead, the cost of re-profiling when hardware changes, or whether the resulting schedule remains near-optimal for CPU-GPU interconnects, VRAM sizes, or model variants outside the evaluated set. This directly affects the generalizability of the lossless, low-overhead claims.

Authors: We acknowledge that empirical data on profiling overhead and re-profiling cost would strengthen the low-overhead claims. We will add measurements of profiling time relative to inference latency and a short analysis showing re-profiling is performed only on hardware changes and remains negligible. For generalizability we will insert a limitations paragraph noting that while the evaluated systems cover a range of client VRAM sizes and interconnects, schedules for untested hardware or model variants may differ; the profile-guided scheduler is intended to adapt automatically, but we cannot claim optimality outside the tested scope without further experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmarks with no derivation chain

full rationale

The paper introduces pipelined sharding (sub-layer sharding + CPU offloading + pipelined copy-compute + prioritized VRAM placement) and VLMOpt as engineering techniques, then reports measured speedups and VRAM reductions versus baselines on specific models and client hardware. No equations, fitted parameters, predictions derived from inputs, or self-citation load-bearing steps appear in the abstract or description. Performance claims rest on direct experimental comparison rather than any reduction to the technique's own definitions or prior self-citations. This is the common case of a systems paper whose central contribution is implementation and measurement, not a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an applied systems paper focused on engineering optimizations, the central claims rest on standard assumptions about CPU-GPU memory hierarchies and scheduling overheads rather than new theoretical constructs.

axioms (1)

domain assumption Typical client CPU-GPU interconnect and memory bandwidth behaviors allow effective pipelining of copy and compute operations.
The technique depends on hardware characteristics that are assumed to hold for the tested NVIDIA client systems.

pith-pipeline@v0.9.0 · 5676 in / 1377 out tokens · 56930 ms · 2026-05-07T13:21:54.197013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

ISBN 978-1-939133-40-3

USENIX Association. ISBN 978-1-939133-40-3. URL https://www.usenix.org/conference/ osdi24/presentation/agrawal. Aminabadi, R. Y ., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., and He, Y . DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. InProceedings o...

work page doi:10.1016/j.jml.2019.104047 2022
[2]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Iqbal, U., Kohno, T., and Roesner, F. LLM Platform Se- curity: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins, 2024. URL https: //arxiv.org/abs/2309.10254. Kshetri, N. Cybercrime and Privacy Threats of Large Lan- guage Models.IT Professional, 25(3):9–13, 2023. doi: 10.1109/MITP.2023.327548...

work page doi:10.1109/mitp.2023.3275489 2024
[3]

doi: 10.1145/1498765

ISSN 0001-0782. doi: 10.1145/1498765. 1498785. URL https://doi.org/10.1145/ 1498765.1498785. Efficient, VRAM-Constrained xLM Inference On Clients Xue, Z., Song, Y ., Mi, Z., Zheng, X., Xia, Y ., and Chen, H. PowerInfer-2: Fast Large Language Model Inference on a Smartphone, 2024. URL https://arxiv.org/ abs/2406.06282. Yi, R., Guo, L., Wei, S., Zhou, A., W...

work page doi:10.1145/1498765 2024

[1] [1]

ISBN 978-1-939133-40-3

USENIX Association. ISBN 978-1-939133-40-3. URL https://www.usenix.org/conference/ osdi24/presentation/agrawal. Aminabadi, R. Y ., Rajbhandari, S., Awan, A. A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., and He, Y . DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. InProceedings o...

work page doi:10.1016/j.jml.2019.104047 2022

[2] [2]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385. Iqbal, U., Kohno, T., and Roesner, F. LLM Platform Se- curity: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins, 2024. URL https: //arxiv.org/abs/2309.10254. Kshetri, N. Cybercrime and Privacy Threats of Large Lan- guage Models.IT Professional, 25(3):9–13, 2023. doi: 10.1109/MITP.2023.327548...

work page doi:10.1109/mitp.2023.3275489 2024

[3] [3]

doi: 10.1145/1498765

ISSN 0001-0782. doi: 10.1145/1498765. 1498785. URL https://doi.org/10.1145/ 1498765.1498785. Efficient, VRAM-Constrained xLM Inference On Clients Xue, Z., Song, Y ., Mi, Z., Zheng, X., Xia, Y ., and Chen, H. PowerInfer-2: Fast Large Language Model Inference on a Smartphone, 2024. URL https://arxiv.org/ abs/2406.06282. Yi, R., Guo, L., Wei, S., Zhou, A., W...

work page doi:10.1145/1498765 2024