pith. sign in

arxiv: 2509.16248 · v3 · submitted 2025-09-17 · 💻 cs.PL · cs.LG· cs.SE

GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

Pith reviewed 2026-05-18 16:00 UTC · model grok-4.3

classification 💻 cs.PL cs.LGcs.SE
keywords Graph breaksPyTorch 2TorchDynamoCode transformationDynamic control flowFX graphPerformance optimizationHugging Face models
0
0 comments X

The pith

GraphMend applies two source code transformations to remove FX graph breaks in PyTorch 2 caused by dynamic control flow and Python side effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GRAPHMEND as a pre-execution compiler pass that rewrites PyTorch programs so TorchDynamo can capture them as single large FX graphs. It targets the two main sources of breaks: conditional statements whose outcomes depend on runtime values and Python operations that produce side effects outside the graph. By handling these patterns automatically through the Jaseci framework, the approach avoids repeated fallbacks to eager execution and the associated CPU-GPU synchronizations. A sympathetic reader would care because the result is faster execution and fewer manual code changes for developers using PyTorch 2.

Core claim

GRAPHMEND introduces two code transformations that eliminate graph breaks due to dynamic control flow and Python side effects, allowing PyTorch's JIT pipeline to produce larger uninterrupted FX graphs across eight Hugging Face models.

What carries the argument

Two code transformations on the Jaseci compilation framework that detect and rewrite dynamic control flow and side-effect patterns before TorchDynamo runs.

Load-bearing premise

The two transformations preserve the original program semantics for the evaluated models without creating new breaks or incorrect results.

What would settle it

A model where the transformed code produces different numerical outputs or more graph breaks than the original on identical inputs.

Figures

Figures reproduced from arXiv: 2509.16248 by Jason Mars, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Savini Kashmira, Thamirawaran Sathiyalogeswaran.

Figure 1
Figure 1. Figure 1: A simple forward function (a) and its compiled FX [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of control flow handling in torch.compile: (a) graph break due to Python control flow, and (b) fixed version using torch.where. Graph Break CL Torch-Compiled Region 0 CL Torch-Compiled Region 1 Setup cudaGraphLaunch DtoH Memcpy cudaGraph.replay cudaStreamSynchronize cudaGraph.replay cudaGraphLaunch CPU GPU 199.714 µs Sync forwardpass - with 1 graph break (a) CL Torch-Compiled Region 0 Setup cuda… view at source ↗
Figure 3
Figure 3. Figure 3: Profiled traces of forward pass execution across CPU and GPU. (a) Forward pass execution trace of code with graph [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fixing graph breaks due to Python I/O: (a) direct print causes a graph break, (b) reordering via variable assignment [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bytecode-level limitations of TorchDynamo.(a) (a) and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GraphMend compiler integration in the Jac pipeline. The pipeline (top) lowers Python/Jac source code into a unified [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latency improvements from GRAPHMEND (a) Cold-start latency reductions (b) Steady-state latency reductions across benchmark models on RTX 3090 and A40 GPUs. TABLE II: Graph break counts in the original model and fix rates achieved by applying GRAPHMEND across the bench￾mark suite. Benchmark Model Graph Breaks Fixed (%) biogpt 2 100 blenderbot-400M-distill 3 100 flan-t5-large 3 100 longformer-base-4096 5 40 … view at source ↗
Figure 8
Figure 8. Figure 8: Relative throughput improvement across benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Profile traces. eliminates the overhead of CPU–GPU context switching dur￾ing steady-state runs. b) Cold run overhead analysis: Cold runs introduce even larger overheads. In the profiler trace of original model ( [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Profiler tracer of cold run of Qwen-Audio-Chat model run on A40 GPU. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CPU/GPU activity time traces for Qwen-Audio-Chat model run on A40 GPU. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Kernel Fusion and Reordering Visualization for Phi [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
read the original abstract

This paper presents GRAPHMEND, a high-level compiler technique that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, introduce costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GRAPHMEND addresses this limitation by analyzing and transforming source code before execution. Built on the Jaseci compilation framework, GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python side effects. This design allows PyTorch's compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GRAPHMEND removes graph breaks due to dynamic control flow and Python side effects, reducing the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GRAPHMEND achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch's dynamic JIT compilation pipeline, substantially improving both usability and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents GraphMend, a high-level compiler technique built on the Jaseci framework that applies two source-level code transformations to PyTorch 2 programs. These transformations target graph breaks arising from dynamic control flow and unsupported Python side effects, enabling TorchDynamo and TorchInductor to produce larger uninterrupted FX graphs. Evaluation on eight Hugging Face models reports that GraphMend reduces break counts to zero in six models and from five to two in one model, yielding up to 75% latency reductions and up to 8% higher end-to-end throughput on NVIDIA RTX 3090 and A40 GPUs.

Significance. If the transformations are semantics-preserving, the work would be significant for the programming languages and compilers community. It shows that automated high-level source transformations can serve as an effective complement to dynamic JIT compilation pipelines in machine-learning frameworks, reducing the need for manual refactoring while delivering measurable performance gains on realistic models.

major comments (2)
  1. [Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.
  2. [Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.
minor comments (1)
  1. [Abstract] The abstract states 'up to 8% higher end-to-end throughput' without specifying the exact baseline or measurement conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.

    Authors: We agree that explicit rewrite rules and a semantic-preservation argument are necessary to support the central claim. In the revised manuscript we will add the concrete source-level rewrite rules applied by the two Jaseci transformations, together with an observational-equivalence argument showing that the transformed programs produce identical observable outputs and side effects to the originals under PyTorch execution. This addition will make clear that the measured latency and throughput gains result from larger uninterrupted FX graphs. revision: yes

  2. Referee: [Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.

    Authors: We acknowledge that the current evaluation section lacks explicit verification details. In the revised version we will add a dedicated paragraph describing our differential-testing procedure: both original and transformed models were executed on the same Hugging Face evaluation inputs, with outputs compared for numerical equivalence within floating-point tolerance. We will also note that the transformations are data- and variant-agnostic and report that the same break-count reductions were observed across the eight models; additional variant testing can be expanded if space allows. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on concrete models with no fitted parameters or self-referential derivations

full rationale

The paper describes a source-level transformation technique (GraphMend) built on Jaseci and evaluates it by counting graph breaks and measuring latency/throughput on eight Hugging Face models. All reported numbers are direct experimental measurements rather than quantities derived from equations, fitted parameters, or self-citations that reduce to the target claim. No derivation chain, uniqueness theorem, or ansatz is invoked that collapses to the inputs by construction. The central performance claims rest on observed behavior of the implemented rewrites, not on any internal fitting or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that source transformations can be performed without semantic change and that the Jaseci framework correctly implements the required rewrites for the target PyTorch constructs.

axioms (2)
  • domain assumption The two code transformations preserve original program semantics for the evaluated models
    Required for the claim that larger graphs can be safely captured without altering model behavior.
  • domain assumption Jaseci can reliably detect and rewrite the relevant dynamic-control-flow and side-effect patterns
    Invoked when the paper states that the transformations are built on the Jaseci compilation framework.

pith-pipeline@v0.9.0 · 5805 in / 1429 out tokens · 46117 ms · 2026-05-18T16:00:34.083098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

  1. [1]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019

  2. [2]

    PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

  3. [3]

    Torch.fx: Practical program capture and transformation for deep learning in python,

    J. K. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08429

  4. [4]

    Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,

    B. Zheng, C. H. Yu, J. Wang, Y . Ding, Y . Liu, Y . Wang, and G. Pekhimenko, “Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1364–1380. [Onlin...

  5. [5]

    TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning

    A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J. Levenberg, M. Hong, R. Monga, and S. Cai, “Tensorflow eager: A multi-stage, python-embedded dsl for machine learning,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01855

  6. [6]

    The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,

    J. Mars, Y . Kang, R. Daynauth, B. Li, A. Mahendra, K. Flautner, and L. Tang, “The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,”IEEE Computer Architecture Letters, vol. 22, no. 2, pp. 101–104, 2023

  7. [7]

    Extending data spatial semantics for scale agnostic programming,

    J. Mars, “Extending data spatial semantics for scale agnostic programming,” 2025. [Online]. Available: https://arxiv.org/abs/2504. 03109

  8. [8]

    Jaseci: The official jaseci code repository,

    Jaseci Labs, “Jaseci: The official jaseci code repository,” https://github. com/Jaseci-Labs/jaseci, 2025

  9. [9]

    Mtp: A meaning-typed language abstraction for ai-integrated programming,

    J. L. Dantanarayana, Y . Kang, K. Sivasothynathan, C. Clarke, B. Li, S. Kashmira, K. Flautner, L. Tang, and J. Mars, “Mtp: A meaning-typed language abstraction for ai-integrated programming,” 2025. [Online]. Available: https://arxiv.org/abs/2405.08965

  10. [10]

    Hugging face: Open-source ai community and tools,

    Hugging Face, “Hugging face: Open-source ai community and tools,” https://huggingface.co, 2025, accessed: 2025-09-12

  11. [11]

    torch.jit.trace — pytorch documentation,

    PyTorch Team, “torch.jit.trace — pytorch documentation,” https://docs. pytorch.org/docs/stable/generated/torch.jit.trace.html, 2025, accessed: 2025-09-08

  12. [12]

    torch.jit.script — pytorch documentation,

    PyTorch-TorchScript Team, “torch.jit.script — pytorch documentation,” https://docs.pytorch.org/docs/stable/generated/torch.jit.script.html, 2025, accessed: 2025-09-08

  13. [13]

    Torch compile troubleshooting — pytorch documentation,

    PyTorch Contributors, “Torch compile troubleshooting — pytorch documentation,” https://docs.pytorch.org/docs/stable/torch.compiler troubleshooting.html, 2025, accessed: 2025-09-08

  14. [14]

    Pygraph: Robust compiler support for cuda graphs in pytorch,

    A. Ghosh, A. Nayak, A. Panwar, and A. Basu, “Pygraph: Robust compiler support for cuda graphs in pytorch,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19779

  15. [15]

    BioGPT: generative pre-trained transformer for biomedical text generation and mining,

    R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “BioGPT: generative pre-trained transformer for biomedical text generation and mining,”Briefings in Bioinformatics, vol. 23, no. 6, 09 2022, bbac409. [Online]. Available: https://doi.org/10.1093/bib/bbac409

  16. [16]

    facebook/blenderbot-400m-distill,

    Meta AI, “facebook/blenderbot-400m-distill,” https://huggingface.co/ facebook/blenderbot-400M-distill, 2020

  17. [17]

    Scaling Instruction-Finetuned Language Models

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned language mo...

  18. [18]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv:2004.05150, 2020

  19. [19]

    babybirdprd/moe-minicpm-x4-base,

    babybirdprd, “babybirdprd/moe-minicpm-x4-base,” https://huggingface. co/babybirdprd/moe-minicpm-x4-base, 2025

  20. [20]

    Microsoft phi-4-mini-instruct,

    Microsoft, “Microsoft phi-4-mini-instruct,” https://huggingface.co/ microsoft/Phi-4-mini-instruct, 2025

  21. [21]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

  22. [22]

    hf-internal-testing/tiny-random-pegasusforcausallm,

    H. Face, “hf-internal-testing/tiny-random-pegasusforcausallm,” https: //huggingface.co/hf-internal-testing/tiny-random-PegasusForCausalLM, 2025, accessed: 2025-09-12; internal testing minimal model

  23. [23]

    Pytorch profiler: A performance debugging and analysis tool for pytorch,

    “Pytorch profiler: A performance debugging and analysis tool for pytorch,” https://pytorch.org/docs/stable/profiler.html, 2021, accessed: 2025-09-12

  24. [24]

    Stablehlo and openxla,

    O. Community, “Stablehlo and openxla,” 2023. [Online]. Available: https://openxla.org

  25. [25]

    Tvm: An automated end-to-end optimizing compiler for deep learning,

    T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 578–594

  26. [26]

    arXiv preprint arXiv:2002.11054 , year=

    C. Lattneret al., “Mlir: A compiler infrastructure for the end of moore’s law,”arXiv preprint arXiv:2002.11054, 2020

  27. [27]

    Glow: Graph Lowering Compiler Techniques for Neural Networks

    N. Rotemet al., “Glow: Graph lowering compiler techniques for neural networks,”arXiv preprint arXiv:1805.00907, 2018

  28. [28]

    Taso: Optimizing deep learning computation with automatic generation of graph substitutions,

    Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Taso: Optimizing deep learning computation with automatic generation of graph substitutions,” inProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 47–62

  29. [29]

    Onnx runtime: High performance inference engine,

    Microsoft, “Onnx runtime: High performance inference engine,” 2018. [Online]. Available: https://onnxruntime.ai

  30. [30]

    DLVM: A modern compiler infrastructure for deep learning systems

    Z. Zhaoet al., “Dlvm: A modern compiler ir for deep learning frameworks,”arXiv preprint arXiv:1711.03016, 2017

  31. [31]

    Triton: An intermediate language and compiler for tiled neural network computations,

    P. Tilletet al., “Triton: An intermediate language and compiler for tiled neural network computations,” inICML Workshop on Systems for ML, 2019

  32. [32]

    Hidet: Task-mapping programming paradigm for deep learning tensor programs,

    J. Songet al., “Hidet: Task-mapping programming paradigm for deep learning tensor programs,” inProceedings of the 18th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI), 2024

  33. [33]

    Halide: A language and compiler for optimizing parallelism, locality, and recomputation,

    J. Ragan-Kelleyet al., “Halide: A language and compiler for optimizing parallelism, locality, and recomputation,” inProceedings of the 34th 11 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013, pp. 519–530

  34. [34]

    Lantern: A search-based compiler for deep learning,

    F. Wanget al., “Lantern: A search-based compiler for deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 6035–6045. 12