GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

Jason Mars; Jayanaka Dantanarayana; Krisztian Flautner; Lingjia Tang; Savini Kashmira; Thamirawaran Sathiyalogeswaran

arxiv: 2509.16248 · v3 · submitted 2025-09-17 · 💻 cs.PL · cs.LG· cs.SE

GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2

Savini Kashmira , Jayanaka Dantanarayana , Thamirawaran Sathiyalogeswaran , Krisztian Flautner , Lingjia Tang , Jason Mars This is my paper

Pith reviewed 2026-05-18 16:00 UTC · model grok-4.3

classification 💻 cs.PL cs.LGcs.SE

keywords Graph breaksPyTorch 2TorchDynamoCode transformationDynamic control flowFX graphPerformance optimizationHugging Face models

0 comments

The pith

GraphMend applies two source code transformations to remove FX graph breaks in PyTorch 2 caused by dynamic control flow and Python side effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GRAPHMEND as a pre-execution compiler pass that rewrites PyTorch programs so TorchDynamo can capture them as single large FX graphs. It targets the two main sources of breaks: conditional statements whose outcomes depend on runtime values and Python operations that produce side effects outside the graph. By handling these patterns automatically through the Jaseci framework, the approach avoids repeated fallbacks to eager execution and the associated CPU-GPU synchronizations. A sympathetic reader would care because the result is faster execution and fewer manual code changes for developers using PyTorch 2.

Core claim

GRAPHMEND introduces two code transformations that eliminate graph breaks due to dynamic control flow and Python side effects, allowing PyTorch's JIT pipeline to produce larger uninterrupted FX graphs across eight Hugging Face models.

What carries the argument

Two code transformations on the Jaseci compilation framework that detect and rewrite dynamic control flow and side-effect patterns before TorchDynamo runs.

Load-bearing premise

The two transformations preserve the original program semantics for the evaluated models without creating new breaks or incorrect results.

What would settle it

A model where the transformed code produces different numerical outputs or more graph breaks than the original on identical inputs.

Figures

Figures reproduced from arXiv: 2509.16248 by Jason Mars, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Savini Kashmira, Thamirawaran Sathiyalogeswaran.

**Figure 2.** Figure 2: Comparison of control flow handling in torch.compile: (a) graph break due to Python control flow, and (b) fixed version using torch.where. Graph Break CL Torch-Compiled Region 0 CL Torch-Compiled Region 1 Setup cudaGraphLaunch DtoH Memcpy cudaGraph.replay cudaStreamSynchronize cudaGraph.replay cudaGraphLaunch CPU GPU 199.714 µs Sync forwardpass - with 1 graph break (a) CL Torch-Compiled Region 0 Setup cuda… view at source ↗

**Figure 3.** Figure 3: Profiled traces of forward pass execution across CPU and GPU. (a) Forward pass execution trace of code with graph [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Fixing graph breaks due to Python I/O: (a) direct print causes a graph break, (b) reordering via variable assignment [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Bytecode-level limitations of TorchDynamo.(a) (a) and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: GraphMend compiler integration in the Jac pipeline. The pipeline (top) lowers Python/Jac source code into a unified [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Latency improvements from GRAPHMEND (a) Cold-start latency reductions (b) Steady-state latency reductions across benchmark models on RTX 3090 and A40 GPUs. TABLE II: Graph break counts in the original model and fix rates achieved by applying GRAPHMEND across the benchmark suite. Benchmark Model Graph Breaks Fixed (%) biogpt 2 100 blenderbot-400M-distill 3 100 flan-t5-large 3 100 longformer-base-4096 5 40 … view at source ↗

**Figure 8.** Figure 8: Relative throughput improvement across benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Profile traces. eliminates the overhead of CPU–GPU context switching during steady-state runs. b) Cold run overhead analysis: Cold runs introduce even larger overheads. In the profiler trace of original model ( [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Profiler tracer of cold run of Qwen-Audio-Chat model run on A40 GPU. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: CPU/GPU activity time traces for Qwen-Audio-Chat model run on A40 GPU. [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Kernel Fusion and Reordering Visualization for Phi [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

read the original abstract

This paper presents GRAPHMEND, a high-level compiler technique that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, introduce costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GRAPHMEND addresses this limitation by analyzing and transforming source code before execution. Built on the Jaseci compilation framework, GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python side effects. This design allows PyTorch's compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GRAPHMEND removes graph breaks due to dynamic control flow and Python side effects, reducing the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GRAPHMEND achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch's dynamic JIT compilation pipeline, substantially improving both usability and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphMend shows that two targeted source rewrites can eliminate most graph breaks in PyTorch 2 models and deliver measurable latency gains, but the lack of any semantic preservation argument leaves the speedups hard to trust.

read the letter

GraphMend takes a practical route to bigger FX graphs in PyTorch 2 by rewriting the source before TorchDynamo sees it. The two transformations focus on dynamic control flow and Python side effects, and the Jaseci integration lets these changes happen automatically. On eight Hugging Face models the break count drops to zero in six cases and from five to two in one more, with latency cuts up to 75 percent on RTX 3090 and A40 hardware plus a small throughput lift. Those numbers are the clearest part of the work and show the idea can matter for real models that currently fall back to eager mode too often. The approach is straightforward and does not require users to rewrite their own code, which is a plus for adoption. The soft spot is the missing check on correctness. The abstract gives no rewrite rules, no invariants, and no test that the transformed program produces the same results as the original. If a rewrite changes data-dependent behavior or introduces a new unsupported construct, the reported gains could be artifacts rather than valid fusion. This paper is aimed at people who maintain or extend PyTorch compilation stacks and at engineers who hit graph breaks in production models. A reader looking for concrete pre-pass techniques will pick up usable details from the evaluation setup even if they want stronger evidence on equivalence. It deserves a serious referee because the problem is common and the empirical results are specific enough to review. Referees can ask for the exact rules and a verification method without dismissing the core contribution.

Referee Report

2 major / 1 minor

Summary. The paper presents GraphMend, a high-level compiler technique built on the Jaseci framework that applies two source-level code transformations to PyTorch 2 programs. These transformations target graph breaks arising from dynamic control flow and unsupported Python side effects, enabling TorchDynamo and TorchInductor to produce larger uninterrupted FX graphs. Evaluation on eight Hugging Face models reports that GraphMend reduces break counts to zero in six models and from five to two in one model, yielding up to 75% latency reductions and up to 8% higher end-to-end throughput on NVIDIA RTX 3090 and A40 GPUs.

Significance. If the transformations are semantics-preserving, the work would be significant for the programming languages and compilers community. It shows that automated high-level source transformations can serve as an effective complement to dynamic JIT compilation pipelines in machine-learning frameworks, reducing the need for manual refactoring while delivering measurable performance gains on realistic models.

major comments (2)

[Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.
[Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.

minor comments (1)

[Abstract] The abstract states 'up to 8% higher end-to-end throughput' without specifying the exact baseline or measurement conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Sections describing the code transformations] The description of the two Jaseci-based transformations (for dynamic control flow and Python side effects) supplies neither the concrete rewrite rules nor any argument for semantic preservation such as bisimulation, observational equivalence, or checked invariants. This is load-bearing for the central claim that the reported latency and throughput improvements arise from valid larger FX graphs rather than from altered program behavior.

Authors: We agree that explicit rewrite rules and a semantic-preservation argument are necessary to support the central claim. In the revised manuscript we will add the concrete source-level rewrite rules applied by the two Jaseci transformations, together with an observational-equivalence argument showing that the transformed programs produce identical observable outputs and side effects to the originals under PyTorch execution. This addition will make clear that the measured latency and throughput gains result from larger uninterrupted FX graphs. revision: yes
Referee: [Evaluation section] The evaluation reports break-count reductions and performance numbers across eight models but contains no details on how semantic equivalence of the transformed programs was verified (e.g., via differential testing, unit-test suites, or invariant checks) or whether the results hold for different data or model variants.

Authors: We acknowledge that the current evaluation section lacks explicit verification details. In the revised version we will add a dedicated paragraph describing our differential-testing procedure: both original and transformed models were executed on the same Hugging Face evaluation inputs, with outputs compared for numerical equivalence within floating-point tolerance. We will also note that the transformations are data- and variant-agnostic and report that the same break-count reductions were observed across the eight models; additional variant testing can be expanded if space allows. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on concrete models with no fitted parameters or self-referential derivations

full rationale

The paper describes a source-level transformation technique (GraphMend) built on Jaseci and evaluates it by counting graph breaks and measuring latency/throughput on eight Hugging Face models. All reported numbers are direct experimental measurements rather than quantities derived from equations, fitted parameters, or self-citations that reduce to the target claim. No derivation chain, uniqueness theorem, or ansatz is invoked that collapses to the inputs by construction. The central performance claims rest on observed behavior of the implemented rewrites, not on any internal fitting or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that source transformations can be performed without semantic change and that the Jaseci framework correctly implements the required rewrites for the target PyTorch constructs.

axioms (2)

domain assumption The two code transformations preserve original program semantics for the evaluated models
Required for the claim that larger graphs can be safely captured without altering model behavior.
domain assumption Jaseci can reliably detect and rewrite the relevant dynamic-control-flow and side-effect patterns
Invoked when the paper states that the transformations are built on the Jaseci compilation framework.

pith-pipeline@v0.9.0 · 5805 in / 1429 out tokens · 46117 ms · 2026-05-18T16:00:34.083098+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRAPHMEND introduces two code transformations that remove graph breaks due to dynamic control flow and Python I/O functions... Predicated Dynamic Control Flow Transformation... Graph-Epilogue Deferred Side Effect Transformation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jac compiler constructs AST + CFG + symbol table (UniiR); GRAPHMEND pass tags IfStmt and print/logger calls then rewrites with torch.where or deferred assignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 6 internal anchors

[1]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019
[2]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

work page doi:10.1145/3620665.3640366 2024
[3]

Torch.fx: Practical program capture and transformation for deep learning in python,

J. K. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08429

work page arXiv 2022
[4]

Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,

B. Zheng, C. H. Yu, J. Wang, Y . Ding, Y . Liu, Y . Wang, and G. Pekhimenko, “Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1364–1380. [Onlin...

work page arXiv 2023
[5]

TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning

A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J. Levenberg, M. Hong, R. Monga, and S. Cai, “Tensorflow eager: A multi-stage, python-embedded dsl for machine learning,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01855

work page internal anchor Pith review Pith/arXiv arXiv 2019
[6]

The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,

J. Mars, Y . Kang, R. Daynauth, B. Li, A. Mahendra, K. Flautner, and L. Tang, “The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,”IEEE Computer Architecture Letters, vol. 22, no. 2, pp. 101–104, 2023

work page 2023
[7]

Extending data spatial semantics for scale agnostic programming,

J. Mars, “Extending data spatial semantics for scale agnostic programming,” 2025. [Online]. Available: https://arxiv.org/abs/2504. 03109

work page 2025
[8]

Jaseci: The official jaseci code repository,

Jaseci Labs, “Jaseci: The official jaseci code repository,” https://github. com/Jaseci-Labs/jaseci, 2025

work page 2025
[9]

Mtp: A meaning-typed language abstraction for ai-integrated programming,

J. L. Dantanarayana, Y . Kang, K. Sivasothynathan, C. Clarke, B. Li, S. Kashmira, K. Flautner, L. Tang, and J. Mars, “Mtp: A meaning-typed language abstraction for ai-integrated programming,” 2025. [Online]. Available: https://arxiv.org/abs/2405.08965

work page arXiv 2025
[10]

Hugging face: Open-source ai community and tools,

Hugging Face, “Hugging face: Open-source ai community and tools,” https://huggingface.co, 2025, accessed: 2025-09-12

work page 2025
[11]

torch.jit.trace — pytorch documentation,

PyTorch Team, “torch.jit.trace — pytorch documentation,” https://docs. pytorch.org/docs/stable/generated/torch.jit.trace.html, 2025, accessed: 2025-09-08

work page 2025
[12]

torch.jit.script — pytorch documentation,

PyTorch-TorchScript Team, “torch.jit.script — pytorch documentation,” https://docs.pytorch.org/docs/stable/generated/torch.jit.script.html, 2025, accessed: 2025-09-08

work page 2025
[13]

Torch compile troubleshooting — pytorch documentation,

PyTorch Contributors, “Torch compile troubleshooting — pytorch documentation,” https://docs.pytorch.org/docs/stable/torch.compiler troubleshooting.html, 2025, accessed: 2025-09-08

work page 2025
[14]

Pygraph: Robust compiler support for cuda graphs in pytorch,

A. Ghosh, A. Nayak, A. Panwar, and A. Basu, “Pygraph: Robust compiler support for cuda graphs in pytorch,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19779

work page arXiv 2025
[15]

BioGPT: generative pre-trained transformer for biomedical text generation and mining,

R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “BioGPT: generative pre-trained transformer for biomedical text generation and mining,”Briefings in Bioinformatics, vol. 23, no. 6, 09 2022, bbac409. [Online]. Available: https://doi.org/10.1093/bib/bbac409

work page doi:10.1093/bib/bbac409 2022
[16]

facebook/blenderbot-400m-distill,

Meta AI, “facebook/blenderbot-400m-distill,” https://huggingface.co/ facebook/blenderbot-400M-distill, 2020

work page 2020
[17]

Scaling Instruction-Finetuned Language Models

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned language mo...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[19]

babybirdprd/moe-minicpm-x4-base,

babybirdprd, “babybirdprd/moe-minicpm-x4-base,” https://huggingface. co/babybirdprd/moe-minicpm-x4-base, 2025

work page 2025
[20]

Microsoft phi-4-mini-instruct,

Microsoft, “Microsoft phi-4-mini-instruct,” https://huggingface.co/ microsoft/Phi-4-mini-instruct, 2025

work page 2025
[21]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

hf-internal-testing/tiny-random-pegasusforcausallm,

H. Face, “hf-internal-testing/tiny-random-pegasusforcausallm,” https: //huggingface.co/hf-internal-testing/tiny-random-PegasusForCausalLM, 2025, accessed: 2025-09-12; internal testing minimal model

work page 2025
[23]

Pytorch profiler: A performance debugging and analysis tool for pytorch,

“Pytorch profiler: A performance debugging and analysis tool for pytorch,” https://pytorch.org/docs/stable/profiler.html, 2021, accessed: 2025-09-12

work page 2021
[24]

Stablehlo and openxla,

O. Community, “Stablehlo and openxla,” 2023. [Online]. Available: https://openxla.org

work page 2023
[25]

Tvm: An automated end-to-end optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 578–594

work page 2018
[26]

arXiv preprint arXiv:2002.11054 , year=

C. Lattneret al., “Mlir: A compiler infrastructure for the end of moore’s law,”arXiv preprint arXiv:2002.11054, 2020

work page arXiv 2002
[27]

Glow: Graph Lowering Compiler Techniques for Neural Networks

N. Rotemet al., “Glow: Graph lowering compiler techniques for neural networks,”arXiv preprint arXiv:1805.00907, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Taso: Optimizing deep learning computation with automatic generation of graph substitutions,

Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Taso: Optimizing deep learning computation with automatic generation of graph substitutions,” inProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 47–62

work page 2019
[29]

Onnx runtime: High performance inference engine,

Microsoft, “Onnx runtime: High performance inference engine,” 2018. [Online]. Available: https://onnxruntime.ai

work page 2018
[30]

DLVM: A modern compiler infrastructure for deep learning systems

Z. Zhaoet al., “Dlvm: A modern compiler ir for deep learning frameworks,”arXiv preprint arXiv:1711.03016, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Triton: An intermediate language and compiler for tiled neural network computations,

P. Tilletet al., “Triton: An intermediate language and compiler for tiled neural network computations,” inICML Workshop on Systems for ML, 2019

work page 2019
[32]

Hidet: Task-mapping programming paradigm for deep learning tensor programs,

J. Songet al., “Hidet: Task-mapping programming paradigm for deep learning tensor programs,” inProceedings of the 18th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI), 2024

work page 2024
[33]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation,

J. Ragan-Kelleyet al., “Halide: A language and compiler for optimizing parallelism, locality, and recomputation,” inProceedings of the 34th 11 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013, pp. 519–530

work page 2013
[34]

Lantern: A search-based compiler for deep learning,

F. Wanget al., “Lantern: A search-based compiler for deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 6035–6045. 12

work page 2018

[1] [1]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,PyTorch: an imperative style, high- performance deep learning library. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019

[2] [2]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y . Liang, J. Liang, Y . Lu, C. K. Luk, B. Maher, Y . Pan, C. Puhrsch, M....

work page doi:10.1145/3620665.3640366 2024

[3] [3]

Torch.fx: Practical program capture and transformation for deep learning in python,

J. K. Reed, Z. DeVito, H. He, A. Ussery, and J. Ansel, “Torch.fx: Practical program capture and transformation for deep learning in python,” 2022. [Online]. Available: https://arxiv.org/abs/2112.08429

work page arXiv 2022

[4] [4]

Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,

B. Zheng, C. H. Yu, J. Wang, Y . Ding, Y . Liu, Y . Wang, and G. Pekhimenko, “Grape: Practical and efficient graphed execution for dynamic deep neural networks on gpus,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 1364–1380. [Onlin...

work page arXiv 2023

[5] [5]

TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning

A. Agrawal, A. N. Modi, A. Passos, A. Lavoie, A. Agarwal, A. Shankar, I. Ganichev, J. Levenberg, M. Hong, R. Monga, and S. Cai, “Tensorflow eager: A multi-stage, python-embedded dsl for machine learning,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01855

work page internal anchor Pith review Pith/arXiv arXiv 2019

[6] [6]

The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,

J. Mars, Y . Kang, R. Daynauth, B. Li, A. Mahendra, K. Flautner, and L. Tang, “The jaseci programming paradigm and runtime stack: Build- ing scale-out production applications easy and fast,”IEEE Computer Architecture Letters, vol. 22, no. 2, pp. 101–104, 2023

work page 2023

[7] [7]

Extending data spatial semantics for scale agnostic programming,

J. Mars, “Extending data spatial semantics for scale agnostic programming,” 2025. [Online]. Available: https://arxiv.org/abs/2504. 03109

work page 2025

[8] [8]

Jaseci: The official jaseci code repository,

Jaseci Labs, “Jaseci: The official jaseci code repository,” https://github. com/Jaseci-Labs/jaseci, 2025

work page 2025

[9] [9]

Mtp: A meaning-typed language abstraction for ai-integrated programming,

J. L. Dantanarayana, Y . Kang, K. Sivasothynathan, C. Clarke, B. Li, S. Kashmira, K. Flautner, L. Tang, and J. Mars, “Mtp: A meaning-typed language abstraction for ai-integrated programming,” 2025. [Online]. Available: https://arxiv.org/abs/2405.08965

work page arXiv 2025

[10] [10]

Hugging face: Open-source ai community and tools,

Hugging Face, “Hugging face: Open-source ai community and tools,” https://huggingface.co, 2025, accessed: 2025-09-12

work page 2025

[11] [11]

torch.jit.trace — pytorch documentation,

PyTorch Team, “torch.jit.trace — pytorch documentation,” https://docs. pytorch.org/docs/stable/generated/torch.jit.trace.html, 2025, accessed: 2025-09-08

work page 2025

[12] [12]

torch.jit.script — pytorch documentation,

PyTorch-TorchScript Team, “torch.jit.script — pytorch documentation,” https://docs.pytorch.org/docs/stable/generated/torch.jit.script.html, 2025, accessed: 2025-09-08

work page 2025

[13] [13]

Torch compile troubleshooting — pytorch documentation,

PyTorch Contributors, “Torch compile troubleshooting — pytorch documentation,” https://docs.pytorch.org/docs/stable/torch.compiler troubleshooting.html, 2025, accessed: 2025-09-08

work page 2025

[14] [14]

Pygraph: Robust compiler support for cuda graphs in pytorch,

A. Ghosh, A. Nayak, A. Panwar, and A. Basu, “Pygraph: Robust compiler support for cuda graphs in pytorch,” 2025. [Online]. Available: https://arxiv.org/abs/2503.19779

work page arXiv 2025

[15] [15]

BioGPT: generative pre-trained transformer for biomedical text generation and mining,

R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “BioGPT: generative pre-trained transformer for biomedical text generation and mining,”Briefings in Bioinformatics, vol. 23, no. 6, 09 2022, bbac409. [Online]. Available: https://doi.org/10.1093/bib/bbac409

work page doi:10.1093/bib/bbac409 2022

[16] [16]

facebook/blenderbot-400m-distill,

Meta AI, “facebook/blenderbot-400m-distill,” https://huggingface.co/ facebook/blenderbot-400M-distill, 2020

work page 2020

[17] [17]

Scaling Instruction-Finetuned Language Models

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and J. Wei, “Scaling instruction-finetuned language mo...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[19] [19]

babybirdprd/moe-minicpm-x4-base,

babybirdprd, “babybirdprd/moe-minicpm-x4-base,” https://huggingface. co/babybirdprd/moe-minicpm-x4-base, 2025

work page 2025

[20] [20]

Microsoft phi-4-mini-instruct,

Microsoft, “Microsoft phi-4-mini-instruct,” https://huggingface.co/ microsoft/Phi-4-mini-instruct, 2025

work page 2025

[21] [21]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

hf-internal-testing/tiny-random-pegasusforcausallm,

H. Face, “hf-internal-testing/tiny-random-pegasusforcausallm,” https: //huggingface.co/hf-internal-testing/tiny-random-PegasusForCausalLM, 2025, accessed: 2025-09-12; internal testing minimal model

work page 2025

[23] [23]

Pytorch profiler: A performance debugging and analysis tool for pytorch,

“Pytorch profiler: A performance debugging and analysis tool for pytorch,” https://pytorch.org/docs/stable/profiler.html, 2021, accessed: 2025-09-12

work page 2021

[24] [24]

Stablehlo and openxla,

O. Community, “Stablehlo and openxla,” 2023. [Online]. Available: https://openxla.org

work page 2023

[25] [25]

Tvm: An automated end-to-end optimizing compiler for deep learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 578–594

work page 2018

[26] [26]

arXiv preprint arXiv:2002.11054 , year=

C. Lattneret al., “Mlir: A compiler infrastructure for the end of moore’s law,”arXiv preprint arXiv:2002.11054, 2020

work page arXiv 2002

[27] [27]

Glow: Graph Lowering Compiler Techniques for Neural Networks

N. Rotemet al., “Glow: Graph lowering compiler techniques for neural networks,”arXiv preprint arXiv:1805.00907, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Taso: Optimizing deep learning computation with automatic generation of graph substitutions,

Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Taso: Optimizing deep learning computation with automatic generation of graph substitutions,” inProceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019, pp. 47–62

work page 2019

[29] [29]

Onnx runtime: High performance inference engine,

Microsoft, “Onnx runtime: High performance inference engine,” 2018. [Online]. Available: https://onnxruntime.ai

work page 2018

[30] [30]

DLVM: A modern compiler infrastructure for deep learning systems

Z. Zhaoet al., “Dlvm: A modern compiler ir for deep learning frameworks,”arXiv preprint arXiv:1711.03016, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Triton: An intermediate language and compiler for tiled neural network computations,

P. Tilletet al., “Triton: An intermediate language and compiler for tiled neural network computations,” inICML Workshop on Systems for ML, 2019

work page 2019

[32] [32]

Hidet: Task-mapping programming paradigm for deep learning tensor programs,

J. Songet al., “Hidet: Task-mapping programming paradigm for deep learning tensor programs,” inProceedings of the 18th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI), 2024

work page 2024

[33] [33]

Halide: A language and compiler for optimizing parallelism, locality, and recomputation,

J. Ragan-Kelleyet al., “Halide: A language and compiler for optimizing parallelism, locality, and recomputation,” inProceedings of the 34th 11 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2013, pp. 519–530

work page 2013

[34] [34]

Lantern: A search-based compiler for deep learning,

F. Wanget al., “Lantern: A search-based compiler for deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 6035–6045. 12

work page 2018