arxiv: 2605.13864 · v1 · submitted 2026-04-30 · 💻 cs.PL

Recognition: no theorem link

Source-to-Source Transformations for GPU Code Generation

Julien de Castelnau , Thomas Koehler , Arthur Chargu\'eraud , Cl\'ement Pit-Claudel

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:29 UTC · model grok-4.3

classification 💻 cs.PL

keywords GPU programmingproof-preserving compilationsource-to-source transformationsCUDA code generationdata race freedomdeadlock freedomfunctional correctness

0 comments

The pith

OptiGPU turns verified CPU programs into optimized GPU code through proof-preserving source-to-source transformations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

OptiGPU lets users begin with a simple, verified CPU program and apply targeted source-to-source transformations to produce efficient CUDA code for GPUs. Each transformation is constructed to automatically preserve the original proofs of data race freedom, deadlock freedom, and functional correctness. This method avoids the heavy burden of directly proving properties on complex low-level GPU implementations. The system models GPU-specific features such as kernel launches, shared memory, and synchronous barriers, generating both device and host code. Case studies on matrix transpose and tree-based parallel reduction demonstrate that the derived CUDA matches established handwritten techniques.

Core claim

OptiGPU applies proof-preserving compilation to GPU programming, allowing verification of low-level, optimized GPU programs through refinement of simple, verified CPU programs. Users direct this refinement with source-to-source transformations that automatically preserve proofs. The system extends an existing CPU framework to model kernel launches, shared memory, and synchronous barriers, producing CUDA code for both device and host sides.

What carries the argument

Proof-preserving source-to-source transformations that refine CPU programs into GPU programs while maintaining verification properties.

Load-bearing premise

The source-to-source transformations correctly model GPU features like kernel launches, shared memory, and barriers while preserving the original proofs of safety and correctness.

What would settle it

A concrete case in which a transformed program exhibits a data race or deadlock absent from the original CPU program, or produces CUDA output that differs functionally from the verified CPU version.

read the original abstract

GPUs have become essential in modern high performance computing, but programming them correctly remains a significant challenge. This difficulty arises from subtle concurrency bugs that result from the explicit management of synchronization primitives and data movement across intricate hierarchies of memory and parallel threads. At the same time, the ability to control these aspects explicitly is at the core of the performance gains granted by GPUs. These challenges have motivated interest in safe GPU programming: tools and languages that can prevent or detect such bugs statically. However, existing approaches make tradeoffs in three dimensions: the strength of their guarantees, the degree of low-level control they allow, and the amount of additional effort required to achieve these safety guarantees. This thesis presents OptiGPU, a system for GPU programming with strong safety guarantees-data race freedom, deadlock freedom, and full functional correctness-that minimizes tradeoffs in the other two dimensions compared to previous approaches. OptiGPU applies proof-preserving compilation to GPU programming, allowing verification of low-level, optimized GPU programs through refinement of simple, verified CPU programs. An OptiGPU user thus avoids the substantial proof burden of directly verifying complex optimized GPU programs, instead directing this refinement with source-to-source transformations that automatically preserve proofs. OptiGPU is implemented as a set of extensions to OptiTrust, an existing framework for proof-preserving compilation on CPUs. OptiGPU models essential GPU programming features, including kernel launches, shared memory, and synchronous barriers, and produces both device and host-side code. We evaluate OptiGPU on two case studies, matrix transpose and tree-based parallel reduction, showing it can derive CUDA code matching techniques found in handwritten references.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptiGPU extends OptiTrust with source-to-source transformations that turn verified CPU programs into optimized GPU code while carrying proofs along.

read the letter

OptiGPU extends OptiTrust with source-to-source transformations that turn verified CPU programs into optimized GPU code while carrying proofs along. The user writes a simple CPU version, verifies it once, then applies transformations to produce CUDA that includes kernel launches, shared memory, and barriers. The proofs of data-race freedom, deadlock freedom, and functional correctness are supposed to transfer automatically. This is the actual new element: the modeling of those GPU primitives inside a refinement framework that already worked for CPUs. The paper produces both device and host code as a result. The two case studies are the clearest strength. For matrix transpose and tree reduction the generated code matches the patterns in standard handwritten references, which shows the transformations can express real optimizations. That is concrete evidence the approach is not just theoretical. The main limitation is the narrow evaluation and the lack of visible detail on the transformation rules or proof arguments. Two examples are not enough to judge generality, and without seeing how shared memory or barriers are formalized it is hard to be sure the preservation holds for other GPU patterns. The abstract claims it works, but the supporting material is mostly descriptive. This paper is for people already working with OptiTrust or similar refinement systems in verified high-performance computing. A reader who needs strong guarantees on GPU code without re-proving everything from scratch would get the most from it. I would bring it to a reading group to look at the actual transformation definitions. It deserves peer review because the idea is practical and the case studies are positive, even if more formal evidence on the proofs would make the case tighter.

Referee Report

2 major / 1 minor

Summary. The paper introduces OptiGPU, an extension to the OptiTrust framework that uses source-to-source transformations to refine verified CPU programs into optimized CUDA GPU code while automatically preserving proofs of data-race freedom, deadlock freedom, and functional correctness. It models GPU-specific features including kernel launches, shared memory, and synchronous barriers, and evaluates the approach on two case studies (matrix transpose and tree-based parallel reduction) where the generated code matches handwritten reference implementations.

Significance. If the transformations correctly preserve the stated properties, OptiGPU would meaningfully reduce the proof burden for safe, high-performance GPU programming by allowing verification to begin from simpler CPU code. The integration with an existing CPU framework and the concrete case studies demonstrating derivation of standard GPU idioms are positive aspects; however, the absence of any derivations, proof sketches, or quantitative verification metrics in the abstract limits the assessed impact.

major comments (2)

Abstract: the central claim that source-to-source transformations 'automatically preserve proofs' of data race freedom, deadlock freedom, and functional correctness for GPU features (kernel launches, shared memory, barriers) is stated without reference to any theorem, lemma, or modeling argument; the case-study descriptions only report syntactic matching to handwritten code and provide no evidence that the proofs survive the transformations.
Evaluation section (case studies): the matrix transpose and tree-reduction examples are presented as successful derivations, yet no quantitative metrics (e.g., performance numbers, proof-checking times, or verification coverage) or error analysis are supplied, leaving the claim that the generated code is both correct and optimized unsupported beyond visual similarity.

minor comments (1)

Abstract: the phrase 'produces both device and host-side code' is introduced without clarifying how host-device data movement and synchronization are modeled in the refinement relation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that source-to-source transformations 'automatically preserve proofs' of data race freedom, deadlock freedom, and functional correctness for GPU features (kernel launches, shared memory, barriers) is stated without reference to any theorem, lemma, or modeling argument; the case-study descriptions only report syntactic matching to handwritten code and provide no evidence that the proofs survive the transformations.

Authors: We agree that the abstract would benefit from explicit references to the supporting theorems. In the manuscript, the proof preservation is formalized in Section 3 (Modeling GPU Features) and Section 4 (Transformation Rules), with Theorem 4.2 establishing that the source-to-source transformations preserve data-race freedom, deadlock freedom, and functional correctness for the modeled GPU constructs. The case studies apply these transformations, and the syntactic matching is a consequence of the successful application of the rules. We will revise the abstract to cite these results and add a brief explanation in the evaluation section on how the general theorems apply to the examples. revision: yes
Referee: Evaluation section (case studies): the matrix transpose and tree-reduction examples are presented as successful derivations, yet no quantitative metrics (e.g., performance numbers, proof-checking times, or verification coverage) or error analysis are supplied, leaving the claim that the generated code is both correct and optimized unsupported beyond visual similarity.

Authors: The evaluation in the current version prioritizes demonstrating the feasibility of deriving optimized GPU code from verified CPU programs using the transformations. The matching to handwritten references indicates both correctness (via proof preservation) and optimization (as these are standard high-performance idioms). However, we acknowledge the value of quantitative metrics. In the revised manuscript, we will add performance measurements (e.g., execution times on a target GPU) for the generated code versus the references, along with verification times using the underlying proof assistant. Since the approach is proof-preserving by design, there is no error analysis per se, but we will include a discussion of any edge cases encountered in the case studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents OptiGPU as an extension of the existing OptiTrust framework for proof-preserving source-to-source transformations targeting GPU features such as kernel launches, shared memory, and barriers. The central claims about automatic preservation of data-race freedom, deadlock freedom, and functional correctness are grounded in the modeling of these features and the inheritance of refinement properties from the base system, without any equations, fitted parameters, or self-definitional reductions that collapse outputs to inputs by construction. Case studies on matrix transpose and tree reduction serve as empirical validation rather than load-bearing derivations. No self-citation is used to justify uniqueness theorems or ansatzes in a circular manner; the approach remains self-contained against external benchmarks of program refinement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that transformations preserve proofs when modeling GPU primitives; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Source-to-source transformations can be defined to preserve proofs of correctness when extending from CPU to GPU features such as kernels and barriers.
Invoked as the core mechanism allowing verification of optimized GPU code via refinement of CPU programs.

pith-pipeline@v0.9.0 · 5600 in / 1206 out tokens · 29479 ms · 2026-05-15T06:29:40.007543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Izumi Asakura, Hidehiko Masuhara, and Tomoyuki Aotani. 2016. Proof of Soundness of Concurrent Separation Logic for GPGPU in Coq. Journal of Information Processing 24, 1 (2016), 132–140. https://doi.org/ 10.2197/ipsjjip.24.132

work page doi:10.2197/ipsjjip.24.132 2016
[2]

Fabian Bannwart and Peter Müller. 2005. A Program Logic for Bytecode. Electronic Notes in Theoretical Computer Science 141, 1 (2005), 255–273. https://doi.org/ https:// doi.org/10.1016/j.entcs.2005.02.026

work page doi:10.1016/j.entcs.2005.02.026 2005
[3]

Cutler, Saman Amarasinghe, and Jonathan Ragan-Kelley

Manya Bansal, Daniel Sainati, Joseph W. Cutler, Saman Amarasinghe, and Jonathan Ragan-Kelley. 2025. Modular GPU Programming with Typed Perspectives. Retrieved from https://arxiv.org/abs/2511.11939

work page arXiv 2025
[4]

Guillaume Bertholon. 2025. Interactive compilation via trustworthy source-to-source transformations. Theses. Retrieved from https://theses.hal.science/tel-05302456

work page 2025
[5]

Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson

work page
[6]

In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '12) , 2012

GPUVerify: a verifier for GPU kernels. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '12) , 2012. Association for Computing Machinery, Tucson, Arizona, USA, 113–132. https://doi.org/ 10.1145/2384616.2384625

work page doi:10.1145/2384616.2384625 2012
[7]

Stefan Blom, Saeed Darabi, Marieke Huisman, and Wytse Oortwijn. 2017. The VerCors Tool Set: Verification of Parallel and Concurrent Software. In Integrated Formal Meth- ods, 2017. Springer International Publishing, Cham, 102–110

work page 2017
[8]

Stephen Brookes and Peter W. O’Hearn. 2016. Concurrent Separation Logic. ACM SIGLOG News 3, 3 (August 2016), 47–65. https://doi.org/ 10.1145/2984450.2984457 69

work page doi:10.1145/2984450.2984457 2016
[9]

Manuel MT Chakravarty, Gabriele Keller, Sean Lee, Trevor L McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the sixth workshop on Declarative aspects of multicore programming, 2011. 3–14

work page 2011
[10]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18), 2018. ...

work page 2018
[11]

Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. Proc. ACM Program. Lang. 4, ICFP (August 2020). https://doi.org/ 10.1145/3408974

work page doi:10.1145/3408974 2020
[12]

Mark Harris. 2012. How to Implement Performance Metrics in CUDA C/C++. Retrieved from https://developer.nvidia.com/blog/how-implement-performance- metrics-cuda-cc/

work page 2012
[13]

Troels Henriksen, Niels GW Serup, Martin Elsman, Fritz Henglein, and Cosmin E Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In Proceedings of the 38th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2017. 556–571

work page 2017
[14]

Bastian Köpcke, Sergei Gorlatch, and Michel Steuwer. 2024. Descend: A Safe GPU Sys- tems Programming Language. Proc. ACM Program. Lang. 8, PLDI (June 2024). https:// doi.org/ 10.1145/3656411

work page doi:10.1145/3656411 2024
[15]

Dennis Liew, Tiago Cogumbreiro, and Julien Lange. 2024. Sound and Partially-Com - plete Static Analysis of Data-Races in GPU Programs. Proc. ACM Program. Lang. 8, OOPSLA2 (October 2024). https://doi.org/ 10.1145/3689797

work page doi:10.1145/3689797 2024
[16]

Guido Martínez, Jonas Fiala, Abhinav Jangda, Angelica Moreira, Nikhil Swamy, and Tyler Sorensen. 2025. Kuiper: verified and efficient GPU programming. Retrieved from https://www.youtube.com/watch?v=imWt5cnZxb0

work page 2025
[17]

George Ciprian Necula. 1998. Compiling with proofs. Doctoral dissertation. 70

work page 1998
[18]

Nvidia. 2026. cuda-samples. Retrieved from https://github.com/NVIDIA/cuda- samples/tree/4f735616ba599fe93cc2c6c85dcb4369260f9643

work page 2026
[19]

O'Hearn, Reynolds, and Yang. 2001. Local Reasoning about Programs that Alter Data Structures. In CSL: 15th Workshop on Computer Science Logic , 2001. LNCS, Springer- Verlag. Retrieved from https://doi.org/10.1007/3-540-44802-0_1

work page doi:10.1007/3-540-44802-0_1 2001
[21]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Confer- ence on Programming Language Design and Implementation , 2013. https://doi.org/ 10.1145/2491956.2462176

work page doi:10.1145/2491956.2462176 2013
[22]

Ari Rasch. 2024. (De/Re)-Composition of Data-Parallel Computations via Multi- Dimensional Homomorphisms. ACM Trans. Program. Lang. Syst. 46, 3 (October 2024). https://doi.org/ 10.1145/3665643

work page doi:10.1145/3665643 2024
[23]

Rust Team. 2026. Rust Programming Language. Retrieved from https://rust-lang.org/

work page 2026
[24]

Ömer Şakar, Mohsen Safari, Marieke Huisman, and Anton Wijs. 2022. Alpinist: An An- notation-Aware GPU Program Optimizer. In Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7,...

work page 2022
[25]

https://doi.org/ 10.1007/978-3-030-99527-0\_18

Springer, 332–352. https://doi.org/ 10.1007/978-3-030-99527-0\_18

work page doi:10.1007/978-3-030-99527-0
[26]

Ömer Şakar, Mohsen Safari, Marieke Huisman, and Anton Wijs. 2025. Preserving provability over GPU program optimizations with annotation-aware transforma - tions. Formal Methods in System Design (2025), 1–57. https://doi.org/ 10.1007/ s10703-025-00480-7 71

work page 2025