Pygraph: Robust compiler support for cuda graphs in pytorch

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch · 2025 · arXiv 2503.19779

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

cs.DC · 2026-05-20 · unverdicted · novelty 6.0

NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.

GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

cs.PL · 2025-09-17 · conditional · novelty 6.0

GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% latency reduction on RTX 3090 and A40 GPUs.

Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

A hybrid JIT-CUDA Graph framework reduces TTFT by up to 66% and P99 latency versus TensorRT-LLM for single-GPU LLaMA-2 7B inference on short prompts.

citing papers explorer

Showing 3 of 3 citing papers.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding cs.DC · 2026-05-20 · unverdicted · none · ref 20
NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2 cs.PL · 2025-09-17 · conditional · none · ref 14
GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% latency reduction on RTX 3090 and A40 GPUs.
Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference cs.LG · 2026-04-25 · unverdicted · none · ref 31
A hybrid JIT-CUDA Graph framework reduces TTFT by up to 66% and P99 latency versus TensorRT-LLM for single-GPU LLaMA-2 7B inference on short prompts.

Pygraph: Robust compiler support for cuda graphs in pytorch

fields

years

verdicts

representative citing papers

citing papers explorer