pith. sign in

arxiv: 2604.23467 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.AI· cs.AR

Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

Pith reviewed 2026-05-08 08:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR
keywords graphinferencehybridacrosslanguagelatencyruntimecomponents
0
0 comments X

The pith

A hybrid JIT-CUDA Graph framework reduces TTFT by up to 66% and P99 latency versus TensorRT-LLM for single-GPU LLaMA-2 7B inference on short prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate text one token at a time, and each step normally requires launching many small GPU operations. These launches add overhead that hurts speed when sequences are short. The authors split the work: fixed parts of the transformer are captured once into a CUDA Graph that can be replayed quickly, while changing parts use freshly compiled kernels. The system captures the graph asynchronously and reuses it across decoding steps. On LLaMA-2 7B with batch size one, this cut the time to produce the first token by as much as 66 percent and lowered the worst-case latency compared with a standard optimized engine. The approach keeps the ability to adapt at runtime instead of requiring everything to be fixed in advance.

Core claim

the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime

Load-bearing premise

That the partitioning into static CUDA-Graph components and dynamic JIT kernels can be performed without introducing correctness errors or hidden overhead during autoregressive decoding across varying prompt lengths.

Figures

Figures reproduced from arXiv: 2604.23467 by Divakar Kumar Yadav, Tian Zhao.

Figure 1
Figure 1. Figure 1: High-level architecture of the Hybrid JIT–CUDA Graph Runtime. Process 1 (JIT Context Generator) executes dynamic operations such as preprocessing and sampling, while Process 2 (CUDA Graph Generator) executes static compute kernels through captured CUDA Graphs, coordinated via inter-process communication (IPC). attention, cache updates, and stochastic sampling commonly found in LLM inference. Conversely, Ju… view at source ↗
Figure 2
Figure 2. Figure 2: Decomposition of LLM inference into static (CUDA Graph-handled) and dynamic (JIT-handled) operations within the hybrid runtime. length and encapsulates the GPU execution DAG of matrix multiplications, fused attention kernels, and normalization layers. Because CUDA Graphs reside entirely on the device, replay bypasses Python execution, CUDA driver dispatch, and host– device synchronization [20], [35]. As a … view at source ↗
Figure 4
Figure 4. Figure 4: Time-to-First-Token (TTFT) scaling from 50 to 500 tokens on NVIDIA H100 (FP16, batch = 1). The hybrid runtime exhibits a smoother scaling trend and lower absolute latency than the baselines. H100 GPU using FP16 precision and batch size 1 to reflect latency-sensitive, interactive inference scenarios and to ensure comparability across systems. A. Quantitative Results Table I reports Time-to-First-Token (TTFT… view at source ↗
Figure 5
Figure 5. Figure 5: P99 per-token latency versus context length. The hybrid runtime exhibits reduced tail latency and lower variance compared with both baselines. C. Interpretation and Practical Implications Empirical usage analyses of deployed LLM systems suggest that a substantial fraction of interactive queries result in short￾to-moderate generations, often within a few hundred output tokens [2], [18]. While precise distri… view at source ↗
read the original abstract

Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a hybrid JIT-CUDA Graph runtime for LLM inference that partitions transformer operations into static components executed via CUDA Graph replay and dynamic components handled by JIT-compiled kernels. This is intended to reduce launch overhead while maintaining flexibility during autoregressive decoding. Evaluation is reported on LLaMA-2 7B (batch size 1, prompt lengths 10–500 tokens), claiming up to 66% TTFT reduction and lower P99 latency versus TensorRT-LLM.

Significance. If the hybrid partitioning can be shown to preserve correctness and introduce no unaccounted overhead across growing KV-cache sizes, the latency improvements would be a useful contribution to low-latency, interactive LLM serving. The approach targets a practical pain point in short-sequence inference. However, the current lack of implementation details, error bars, and validation metrics prevents a firm assessment of whether the claimed gains are real or artifactual.

major comments (3)
  1. Abstract: The headline claims of up to 66% TTFT reduction and lower P99 latency are presented without error bars, statistical tests, or any description of how the static/dynamic partition was chosen, implemented, or validated for output equivalence during token-by-token KV-cache growth.
  2. Abstract (and Evaluation section): No quantitative isolation of capture time, JIT compilation cost, or kernel-launch overhead is supplied, leaving open the possibility that re-capture or re-compilation triggered by shape changes in the 10–500 token regime adds latency that is not subtracted from the reported TTFT numbers.
  3. Abstract: The central assumption that the hybrid partition incurs neither correctness errors nor hidden overhead across autoregressive steps with varying prompt lengths is unsupported by any reported equivalence checks or per-step latency breakdowns versus the TensorRT-LLM baseline.
minor comments (1)
  1. Abstract: The phrase 'asynchronous graph capture and reuse across decoding steps' is introduced without a concrete mechanism or timing diagram, making it difficult to understand how capture is overlapped with execution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional rigor would strengthen the manuscript. We address each major comment below and will incorporate revisions to provide the requested details on statistical reporting, overhead isolation, and validation.

read point-by-point responses
  1. Referee: Abstract: The headline claims of up to 66% TTFT reduction and lower P99 latency are presented without error bars, statistical tests, or any description of how the static/dynamic partition was chosen, implemented, or validated for output equivalence during token-by-token KV-cache growth.

    Authors: We agree the abstract omits these elements. Section 3 of the manuscript describes the partition: operations with fixed shapes (e.g., linear projections) are captured in CUDA Graphs, while KV-cache updates use JIT kernels to handle dynamic growth. Equivalence was checked by comparing generated token sequences and logits against TensorRT-LLM within floating-point tolerance. To address the concern, we will revise the abstract to briefly note the partition heuristic and add error bars (from 5 runs) plus a statistical comparison in the evaluation section. revision: yes

  2. Referee: Abstract (and Evaluation section): No quantitative isolation of capture time, JIT compilation cost, or kernel-launch overhead is supplied, leaving open the possibility that re-capture or re-compilation triggered by shape changes in the 10–500 token regime adds latency that is not subtracted from the reported TTFT numbers.

    Authors: This observation is correct; the current text reports aggregate TTFT without component breakdowns. Graph capture occurs once asynchronously at the start and is amortized; JIT compilation is performed only on the initial shape and reused. No re-capture is triggered in the evaluated range because dynamic dimensions are handled by the JIT kernels. In revision we will add a table in the evaluation section with measured capture time, compilation cost, and per-step launch overhead, ensuring these are not included in the net TTFT savings. revision: yes

  3. Referee: Abstract: The central assumption that the hybrid partition incurs neither correctness errors nor hidden overhead across autoregressive steps with varying prompt lengths is unsupported by any reported equivalence checks or per-step latency breakdowns versus the TensorRT-LLM baseline.

    Authors: We acknowledge that explicit per-step breakdowns and detailed equivalence reporting are absent from the current version. The manuscript states that outputs remain equivalent, but does not show the supporting data. We will add a new figure with per-token latency traces and cumulative TTFT curves (with error bars) versus the baseline, plus a short description of the validation procedure (logit comparison and sequence matching). This will confirm absence of hidden overhead or correctness issues. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical latency claims rest on direct measurements, not derivations

full rationale

The manuscript presents a hybrid JIT-CUDA Graph partitioning framework for LLM inference and reports measured TTFT and P99 latency improvements versus TensorRT-LLM on LLaMA-2 7B. No equations, parameter fits, self-citations, or uniqueness theorems appear in the abstract or described content that would reduce any claimed result to its own inputs by construction. The 66% TTFT reduction is an observed benchmark outcome, not a prediction derived from fitted quantities or prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5491 in / 1070 out tokens · 35613 ms · 2026-05-08T08:09:30.642879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, and et al., “Language models are few-shot learners,” inNeurIPS, 2020

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku,

    Anthropic AI, “The claude 3 model family: Opus, sonnet, haiku,” 2024, technical Report, Anthropic PBC. [Online]. Available: https: //api.semanticscholar.org/CorpusID:268232499

  3. [3]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    P. Georgiev, V . Lin, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, and C.-K. Yeh, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024, arXiv preprint arXiv:2403.05530v5. [Online]. Available: https://doi.org/10. 48550/arXiv.2403.05530

  4. [4]

    Grok technical overview,

    xAI Corporation, “Grok technical overview,” 2024. [Online]. Available: https://x.ai/blog/grok

  5. [5]

    Efficient training of large language models on distributed infrastructures: a survey.arXiv preprint arXiv:2407.20018, 2024

    J. Duan, S. Zhang, Z. Wang, L. Jiang, W. Qu, Q. Hu, G. Wang, Q. Weng, H. Yan, X. Zhang, X. Qiu, D. Lin, Y . Wen, X. Jin, T. Zhang, and P. Sun, “Efficient training of large language models on distributed infrastructures: A survey,” 2024, arXiv preprint arXiv:2407.20018. [Online]. Available: https://arxiv.org/abs/2408.20018

  6. [6]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

    J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” inProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). New York, NY , USA: Association for Computing Machinery, 2020, pp. 3505– 3506

  7. [7]

    Deepspeed- inference: Enabling efficient inference of transformer models at unprece- dented scale,

    R. Y . Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y . He, “Deepspeed- inference: Enabling efficient inference of transformer models at unprece- dented scale,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’22). Dallas, Texa...

  8. [8]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23). New York, NY , USA: Association for Computing Machinery, 2023, pp. 611–626

  9. [9]

    Introducing nvfuser, a deep learning compiler for pytorch,

    C. Sarofeen, P. Bialecki, J. Jiang, K. Stephano, M. Kozuki, N. Vaidya, and S. Bekman, “Introducing nvfuser, a deep learning compiler for pytorch,” PyTorch Blog, 2022, august 26, 2022. [Online]. Available: https://pytorch.org/blog/ introducing-nvfuser-a-deep-learning-compiler-for-pytorch/

  10. [10]

    Triton: An intermediate language and compiler for tiled neural network computations,

    P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). Phoenix, AZ, USA: Association for Computing Machinery, 2019, pp. 10–19

  11. [11]

    Internal design and optimization of torch.compile,

    PyTorch Team, “Internal design and optimization of torch.compile,”

  12. [12]

    Available: https://docs.pytorch.org/docs/stable/torch

    [Online]. Available: https://docs.pytorch.org/docs/stable/torch. compiler.html

  13. [13]

    Dettmers, M

    T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” 2022, arXiv preprint arXiv:2110.02861. [Online]. Available: https://arxiv.org/abs/2110.02861

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023, arXiv preprint arXiv:2210.17323. [Online]. Available: https: //arxiv.org/abs/2210.17323

  15. [15]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inProceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22). Red Hook, NY , USA: Curran Associates Inc., 2022, pp. Article 1189, 16 pages

  16. [16]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” 2023, arXiv preprint arXiv:2307.08691. [Online]. Available: https://arxiv.org/abs/2307.08691

  17. [17]

    Tensorrt-llm: Optimized inference for large language models,

    NVIDIA Corporation, “Tensorrt-llm: Optimized inference for large language models,” 2023. [Online]. Available: https://developer.nvidia. com/tensorrt-llm

  18. [18]

    Fastertransformer: Efficient transformer inference on gpus,

    ——, “Fastertransformer: Efficient transformer inference on gpus,” 2023. [Online]. Available: https://github.com/NVIDIA/ FasterTransformer

  19. [19]

    How people use chatgpt,

    A. Chatterji, T. Cunningham, D. J. Deming, Z. Hitzig, C. Ong, C. Y . Shan, and K. Wadman, “How people use chatgpt,” 2025, nBER Working Paper No. 34255, National Bureau of Economic Research, Cambridge, MA. [Online]. Available: https://www.nber.org/papers/w34255

  20. [20]

    Torchdynamo: Python-free graph extraction for pytorch,

    PyTorch Team, “Torchdynamo: Python-free graph extraction for pytorch,” 2002. [Online]. Available: https://docs.pytorch.org/docs/ stable/torch.compiler dynamo overview.html

  21. [21]

    Cuda graphs overview,

    NVIDIA Corporation, “Cuda graphs overview,” 2024. [Online]. Available: https://developer.nvidia.com/blog/cuda-graphs

  22. [22]

    Cuda graphs api documentation,

    PyTorch Team, “Cuda graphs api documentation,” 2024. [Online]. Available: https://docs.pytorch.org/docs/stable/generated/torch. cuda.CUDAGraph.html

  23. [23]

    Torchscript — pytorch documentation,

    PyTorch Contributors, “Torchscript — pytorch documentation,” 2025, accessed: 2025-10-16. [Online]. Available: https://docs.pytorch.org/ docs/main/jit.html

  24. [24]

    PyTorch 2: Faster Ma- chine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, 2017

  26. [26]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, L. Martin, K. Stone, and et al., “Llama: Open and efficient foundation language models,” 2023, arXiv preprint arXiv:2302.13971

  27. [27]

    Mistral 7B

    A. Q. Jiang and et al., “Mistral 7b,” 2023, arXiv preprint arXiv:2310.06825. [Online]. Available: https://arxiv.org/abs/2310.06825

  28. [28]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, and et al., “Megatron-lm: Training multi- billion parameter language models using model parallelism,” 2019, arXiv preprint arXiv:1909.08053

  29. [29]

    Tvm: An automated end-to-end optimizing compiler for deep learning,

    T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI’18). Carlsbad, CA, USA: USENIX Association, 2018, pp. 579–594

  30. [30]

    Constant time launch for straight-line cuda graphs and other performance enhancements,

    H. Hoffman and F. Oh, “Constant time launch for straight-line cuda graphs and other performance enhancements,” NVIDIA Developer Blog, 2024

  31. [31]

    arXiv:2503.19779 [cs.LG]https://arxiv.org/abs/2503.19779

    A. Ghosh, A. Nayak, A. Panwa, and A. Basu, “Pygraph: Robust compiler support for cuda graphs in pytorch,” 2025, arXiv preprint arXiv:2503.19779. [Online]. Available: https://arxiv.org/abs/2503.19779

  32. [32]

    Multi-gpu greedy scheduling through a polyglot runtime,

    I. D. D. Lavore, G. W. D. Donato, A. Parravicini, F. Sgherzi, D. Bonetta, and M. D. Santambrogio, “Multi-gpu greedy scheduling through a polyglot runtime,” inProceedings of the 22nd ACM International Conference on Computing Frontiers (CF ’25), 2025, pp. 185–194. [Online]. Available: https://doi.org/10.1145/3719276.3725199

  33. [33]

    Xla - tensorflow, compiled,

    XLA Team, “Xla - tensorflow, compiled,” Google Developers Blog, 2017. [Online]. Available: https://developers.googleblog.com/en/ xla-tensorflow-compiled/

  34. [34]

    Dynpipe: Toward dynamic end-to-end pipeline parallelism for interference-aware dnn training,

    Z. Yuan, X. Wang, Y . Nie, Y . Tao, Y . Li, Z. Shao, X. Liao, B. Li, and H. Jin, “Dynpipe: Toward dynamic end-to-end pipeline parallelism for interference-aware dnn training,”IEEE Transactions on Parallel and Distributed Systems, vol. 36, no. 11, pp. 2366–2382, 2025

  35. [35]

    Xing, Joseph E

    L. Zheng, Z. Li, H. Zhang, Y . Zhuang, Z. Chen, Y . Huang, Y . Wang, Y . Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica, “Alpa: Automating inter- and intra-operator parallelism for distributed deep learning,” 2022, arXiv preprint arXiv:2201.12023. [Online]. Available: https://arxiv.org/abs/2201.12023

  36. [36]

    Boosting performance of iterative applications on gpus: Kernel batching with cuda graphs,

    J. Ekelund, S. Markidis, and I. Peng, “Boosting performance of iterative applications on gpus: Kernel batching with cuda graphs,” 2025, arXiv preprint arXiv:2501.09398, Accepted to PDP 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.09398

  37. [37]

    cublas library documentation,

    NVIDIA Corporation, “cublas library documentation,” 2025. [Online]. Available: https://docs.nvidia.com/cuda/cublas/

  38. [38]

    Optimization techniques for gpu programming,

    P. Hijma, S. Heldens, A. Sclocco, B. van Werkhoven, and H. E. Bal, “Optimization techniques for gpu programming,”ACM Computing Surveys, vol. 55, no. 11, pp. 1–81, 2023