GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
Pith reviewed 2026-05-18 16:00 UTC · model grok-4.3
The pith
GraphMend applies two source code transformations to remove FX graph breaks in PyTorch 2 caused by dynamic control flow and Python side effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GRAPHMEND introduces two code transformations that eliminate graph breaks due to dynamic control flow and Python side effects, allowing PyTorch's JIT pipeline to produce larger uninterrupted FX graphs across eight Hugging Face models.
What carries the argument
Two code transformations on the Jaseci compilation framework that detect and rewrite dynamic control flow and side-effect patterns before TorchDynamo runs.
Load-bearing premise
The two transformations preserve the original program semantics for the evaluated models without creating new breaks or incorrect results.
What would settle it
A model where the transformed code produces different numerical outputs or more graph breaks than the original on identical inputs.
Figures
read the original abstract
This paper presents GRAPHMEND, a high-level compiler technique that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, introduce costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GRAPHMEND addresses this limitation by analyzing and transforming source code before execution. Built on the Jaseci compilation framework, GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python side effects. This design allows PyTorch's compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GRAPHMEND removes graph breaks due to dynamic control flow and Python side effects, reducing the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GRAPHMEND achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch's dynamic JIT compilation pipeline, substantially improving both usability and performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GraphMend, a high-level compiler technique built on the Jaseci framework that applies two source-level code transformations to PyTorch 2 programs. These transformations target graph breaks arising from dynamic control flow and unsupported Python side effects, enabling TorchDynamo and TorchInductor to produce larger uninterrupted FX graphs. Evaluation on eight Hugging Face models reports that GraphMend reduces break counts to zero in six models and from five to two in one model, yielding up to 75% latency reductions and up to 8% higher end-to-end throughput on NVIDIA RTX 3090 and A40 GPUs.
Significance. If the transformations are semantics-preserving, the work would be significant for the programming languages and compilers community. It shows that automated high-level source transformations can serve as an effective complement to dynamic JIT compilation pipelines in machine-learning frameworks, reducing the need for manual refactoring while delivering measurable performance gains on realistic models.
major comments (2)
- [Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.
- [Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.
minor comments (1)
- [Abstract] The abstract states 'up to 8% higher end-to-end throughput' without specifying the exact baseline or measurement conditions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.
Authors: We agree that explicit rewrite rules and a semantic-preservation argument are necessary to support the central claim. In the revised manuscript we will add the concrete source-level rewrite rules applied by the two Jaseci transformations, together with an observational-equivalence argument showing that the transformed programs produce identical observable outputs and side effects to the originals under PyTorch execution. This addition will make clear that the measured latency and throughput gains result from larger uninterrupted FX graphs. revision: yes
-
Referee: [Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.
Authors: We acknowledge that the current evaluation section lacks explicit verification details. In the revised version we will add a dedicated paragraph describing our differential-testing procedure: both original and transformed models were executed on the same Hugging Face evaluation inputs, with outputs compared for numerical equivalence within floating-point tolerance. We will also note that the transformations are data- and variant-agnostic and report that the same break-count reductions were observed across the eight models; additional variant testing can be expanded if space allows. revision: yes
Circularity Check
No circularity: empirical results on concrete models with no fitted parameters or self-referential derivations
full rationale
The paper describes a source-level transformation technique (GraphMend) built on Jaseci and evaluates it by counting graph breaks and measuring latency/throughput on eight Hugging Face models. All reported numbers are direct experimental measurements rather than quantities derived from equations, fitted parameters, or self-citations that reduce to the target claim. No derivation chain, uniqueness theorem, or ansatz is invoked that collapses to the inputs by construction. The central performance claims rest on observed behavior of the implemented rewrites, not on any internal fitting or definitional loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The two code transformations preserve original program semantics for the evaluated models
- domain assumption Jaseci can reliably detect and rewrite the relevant dynamic-control-flow and side-effect patterns
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python I/O functions... Predicated Dynamic Control Flow Transformation... Graph-Epilogue Deferred Side Effect Transformation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Jac compiler constructs AST + CFG + symbol table (UniiR); GRAPHMEND pass tags IfStmt and print/logger calls then rewrites with torch.where or deferred assignment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019
work page 2019
-
[2]
J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....
-
[3]
Torch.fx: Practical program capture and transformation for deep learning in python,
J. K. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08429
-
[4]
Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,
B. Zheng, C. H. Yu, J. Wang, Y . Ding, Y . Liu, Y . Wang, and G. Pekhimenko, “Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1364–1380. [Onlin...
-
[5]
TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning
A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J. Levenberg, M. Hong, R. Monga, and S. Cai, “Tensorflow eager: A multi-stage, python-embedded dsl for machine learning,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01855
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[6]
J. Mars, Y . Kang, R. Daynauth, B. Li, A. Mahendra, K. Flautner, and L. Tang, “The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,”IEEE Computer Architecture Letters, vol. 22, no. 2, pp. 101–104, 2023
work page 2023
-
[7]
Extending data spatial semantics for scale agnostic programming,
J. Mars, “Extending data spatial semantics for scale agnostic programming,” 2025. [Online]. Available: https://arxiv.org/abs/2504. 03109
work page 2025
-
[8]
Jaseci: The official jaseci code repository,
Jaseci Labs, “Jaseci: The official jaseci code repository,” https://github. com/Jaseci-Labs/jaseci, 2025
work page 2025
-
[9]
Mtp: A meaning-typed language abstraction for ai-integrated programming,
J. L. Dantanarayana, Y . Kang, K. Sivasothynathan, C. Clarke, B. Li, S. Kashmira, K. Flautner, L. Tang, and J. Mars, “Mtp: A meaning-typed language abstraction for ai-integrated programming,” 2025. [Online]. Available: https://arxiv.org/abs/2405.08965
-
[10]
Hugging face: Open-source ai community and tools,
Hugging Face, “Hugging face: Open-source ai community and tools,” https://huggingface.co, 2025, accessed: 2025-09-12
work page 2025
-
[11]
torch.jit.trace — pytorch documentation,
PyTorch Team, “torch.jit.trace — pytorch documentation,” https://docs. pytorch.org/docs/stable/generated/torch.jit.trace.html, 2025, accessed: 2025-09-08
work page 2025
-
[12]
torch.jit.script — pytorch documentation,
PyTorch-TorchScript Team, “torch.jit.script — pytorch documentation,” https://docs.pytorch.org/docs/stable/generated/torch.jit.script.html, 2025, accessed: 2025-09-08
work page 2025
-
[13]
Torch compile troubleshooting — pytorch documentation,
PyTorch Contributors, “Torch compile troubleshooting — pytorch documentation,” https://docs.pytorch.org/docs/stable/torch.compiler troubleshooting.html, 2025, accessed: 2025-09-08
work page 2025
-
[14]
Pygraph: Robust compiler support for cuda graphs in pytorch,
A. Ghosh, A. Nayak, A. Panwar, and A. Basu, “Pygraph: Robust compiler support for cuda graphs in pytorch,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19779
-
[15]
BioGPT: generative pre-trained transformer for biomedical text generation and mining,
R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “BioGPT: generative pre-trained transformer for biomedical text generation and mining,”Briefings in Bioinformatics, vol. 23, no. 6, 09 2022, bbac409. [Online]. Available: https://doi.org/10.1093/bib/bbac409
-
[16]
facebook/blenderbot-400m-distill,
Meta AI, “facebook/blenderbot-400m-distill,” https://huggingface.co/ facebook/blenderbot-400M-distill, 2020
work page 2020
-
[17]
Scaling Instruction-Finetuned Language Models
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned language mo...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[19]
babybirdprd/moe-minicpm-x4-base,
babybirdprd, “babybirdprd/moe-minicpm-x4-base,” https://huggingface. co/babybirdprd/moe-minicpm-x4-base, 2025
work page 2025
-
[20]
Microsoft phi-4-mini-instruct,
Microsoft, “Microsoft phi-4-mini-instruct,” https://huggingface.co/ microsoft/Phi-4-mini-instruct, 2025
work page 2025
-
[21]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
hf-internal-testing/tiny-random-pegasusforcausallm,
H. Face, “hf-internal-testing/tiny-random-pegasusforcausallm,” https: //huggingface.co/hf-internal-testing/tiny-random-PegasusForCausalLM, 2025, accessed: 2025-09-12; internal testing minimal model
work page 2025
-
[23]
Pytorch profiler: A performance debugging and analysis tool for pytorch,
“Pytorch profiler: A performance debugging and analysis tool for pytorch,” https://pytorch.org/docs/stable/profiler.html, 2021, accessed: 2025-09-12
work page 2021
-
[24]
O. Community, “Stablehlo and openxla,” 2023. [Online]. Available: https://openxla.org
work page 2023
-
[25]
Tvm: An automated end-to-end optimizing compiler for deep learning,
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 578–594
work page 2018
-
[26]
arXiv preprint arXiv:2002.11054 , year=
C. Lattneret al., “Mlir: A compiler infrastructure for the end of moore’s law,”arXiv preprint arXiv:2002.11054, 2020
-
[27]
Glow: Graph Lowering Compiler Techniques for Neural Networks
N. Rotemet al., “Glow: Graph lowering compiler techniques for neural networks,”arXiv preprint arXiv:1805.00907, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Taso: Optimizing deep learning computation with automatic generation of graph substitutions,
Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Taso: Optimizing deep learning computation with automatic generation of graph substitutions,” inProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 47–62
work page 2019
-
[29]
Onnx runtime: High performance inference engine,
Microsoft, “Onnx runtime: High performance inference engine,” 2018. [Online]. Available: https://onnxruntime.ai
work page 2018
-
[30]
DLVM: A modern compiler infrastructure for deep learning systems
Z. Zhaoet al., “Dlvm: A modern compiler ir for deep learning frameworks,”arXiv preprint arXiv:1711.03016, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Triton: An intermediate language and compiler for tiled neural network computations,
P. Tilletet al., “Triton: An intermediate language and compiler for tiled neural network computations,” inICML Workshop on Systems for ML, 2019
work page 2019
-
[32]
Hidet: Task-mapping programming paradigm for deep learning tensor programs,
J. Songet al., “Hidet: Task-mapping programming paradigm for deep learning tensor programs,” inProceedings of the 18th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI), 2024
work page 2024
-
[33]
Halide: A language and compiler for optimizing parallelism, locality, and recomputation,
J. Ragan-Kelleyet al., “Halide: A language and compiler for optimizing parallelism, locality, and recomputation,” inProceedings of the 34th 11 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013, pp. 519–530
work page 2013
-
[34]
Lantern: A search-based compiler for deep learning,
F. Wanget al., “Lantern: A search-based compiler for deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 6035–6045. 12
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.