Recognition: no theorem link
Source-to-Source Transformations for GPU Code Generation
Pith reviewed 2026-05-15 06:29 UTC · model grok-4.3
The pith
OptiGPU turns verified CPU programs into optimized GPU code through proof-preserving source-to-source transformations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OptiGPU applies proof-preserving compilation to GPU programming, allowing verification of low-level, optimized GPU programs through refinement of simple, verified CPU programs. Users direct this refinement with source-to-source transformations that automatically preserve proofs. The system extends an existing CPU framework to model kernel launches, shared memory, and synchronous barriers, producing CUDA code for both device and host sides.
What carries the argument
Proof-preserving source-to-source transformations that refine CPU programs into GPU programs while maintaining verification properties.
Load-bearing premise
The source-to-source transformations correctly model GPU features like kernel launches, shared memory, and barriers while preserving the original proofs of safety and correctness.
What would settle it
A concrete case in which a transformed program exhibits a data race or deadlock absent from the original CPU program, or produces CUDA output that differs functionally from the verified CPU version.
read the original abstract
GPUs have become essential in modern high performance computing, but programming them correctly remains a significant challenge. This difficulty arises from subtle concurrency bugs that result from the explicit management of synchronization primitives and data movement across intricate hierarchies of memory and parallel threads. At the same time, the ability to control these aspects explicitly is at the core of the performance gains granted by GPUs. These challenges have motivated interest in safe GPU programming: tools and languages that can prevent or detect such bugs statically. However, existing approaches make tradeoffs in three dimensions: the strength of their guarantees, the degree of low-level control they allow, and the amount of additional effort required to achieve these safety guarantees. This thesis presents OptiGPU, a system for GPU programming with strong safety guarantees-data race freedom, deadlock freedom, and full functional correctness-that minimizes tradeoffs in the other two dimensions compared to previous approaches. OptiGPU applies proof-preserving compilation to GPU programming, allowing verification of low-level, optimized GPU programs through refinement of simple, verified CPU programs. An OptiGPU user thus avoids the substantial proof burden of directly verifying complex optimized GPU programs, instead directing this refinement with source-to-source transformations that automatically preserve proofs. OptiGPU is implemented as a set of extensions to OptiTrust, an existing framework for proof-preserving compilation on CPUs. OptiGPU models essential GPU programming features, including kernel launches, shared memory, and synchronous barriers, and produces both device and host-side code. We evaluate OptiGPU on two case studies, matrix transpose and tree-based parallel reduction, showing it can derive CUDA code matching techniques found in handwritten references.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OptiGPU, an extension to the OptiTrust framework that uses source-to-source transformations to refine verified CPU programs into optimized CUDA GPU code while automatically preserving proofs of data-race freedom, deadlock freedom, and functional correctness. It models GPU-specific features including kernel launches, shared memory, and synchronous barriers, and evaluates the approach on two case studies (matrix transpose and tree-based parallel reduction) where the generated code matches handwritten reference implementations.
Significance. If the transformations correctly preserve the stated properties, OptiGPU would meaningfully reduce the proof burden for safe, high-performance GPU programming by allowing verification to begin from simpler CPU code. The integration with an existing CPU framework and the concrete case studies demonstrating derivation of standard GPU idioms are positive aspects; however, the absence of any derivations, proof sketches, or quantitative verification metrics in the abstract limits the assessed impact.
major comments (2)
- Abstract: the central claim that source-to-source transformations 'automatically preserve proofs' of data race freedom, deadlock freedom, and functional correctness for GPU features (kernel launches, shared memory, barriers) is stated without reference to any theorem, lemma, or modeling argument; the case-study descriptions only report syntactic matching to handwritten code and provide no evidence that the proofs survive the transformations.
- Evaluation section (case studies): the matrix transpose and tree-reduction examples are presented as successful derivations, yet no quantitative metrics (e.g., performance numbers, proof-checking times, or verification coverage) or error analysis are supplied, leaving the claim that the generated code is both correct and optimized unsupported beyond visual similarity.
minor comments (1)
- Abstract: the phrase 'produces both device and host-side code' is introduced without clarifying how host-device data movement and synchronization are modeled in the refinement relation.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive suggestions. We address each major comment below, indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim that source-to-source transformations 'automatically preserve proofs' of data race freedom, deadlock freedom, and functional correctness for GPU features (kernel launches, shared memory, barriers) is stated without reference to any theorem, lemma, or modeling argument; the case-study descriptions only report syntactic matching to handwritten code and provide no evidence that the proofs survive the transformations.
Authors: We agree that the abstract would benefit from explicit references to the supporting theorems. In the manuscript, the proof preservation is formalized in Section 3 (Modeling GPU Features) and Section 4 (Transformation Rules), with Theorem 4.2 establishing that the source-to-source transformations preserve data-race freedom, deadlock freedom, and functional correctness for the modeled GPU constructs. The case studies apply these transformations, and the syntactic matching is a consequence of the successful application of the rules. We will revise the abstract to cite these results and add a brief explanation in the evaluation section on how the general theorems apply to the examples. revision: yes
-
Referee: Evaluation section (case studies): the matrix transpose and tree-reduction examples are presented as successful derivations, yet no quantitative metrics (e.g., performance numbers, proof-checking times, or verification coverage) or error analysis are supplied, leaving the claim that the generated code is both correct and optimized unsupported beyond visual similarity.
Authors: The evaluation in the current version prioritizes demonstrating the feasibility of deriving optimized GPU code from verified CPU programs using the transformations. The matching to handwritten references indicates both correctness (via proof preservation) and optimization (as these are standard high-performance idioms). However, we acknowledge the value of quantitative metrics. In the revised manuscript, we will add performance measurements (e.g., execution times on a target GPU) for the generated code versus the references, along with verification times using the underlying proof assistant. Since the approach is proof-preserving by design, there is no error analysis per se, but we will include a discussion of any edge cases encountered in the case studies. revision: partial
Circularity Check
No significant circularity
full rationale
The paper presents OptiGPU as an extension of the existing OptiTrust framework for proof-preserving source-to-source transformations targeting GPU features such as kernel launches, shared memory, and barriers. The central claims about automatic preservation of data-race freedom, deadlock freedom, and functional correctness are grounded in the modeling of these features and the inheritance of refinement properties from the base system, without any equations, fitted parameters, or self-definitional reductions that collapse outputs to inputs by construction. Case studies on matrix transpose and tree reduction serve as empirical validation rather than load-bearing derivations. No self-citation is used to justify uniqueness theorems or ansatzes in a circular manner; the approach remains self-contained against external benchmarks of program refinement.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Source-to-source transformations can be defined to preserve proofs of correctness when extending from CPU to GPU features such as kernels and barriers.
Reference graph
Works this paper leans on
-
[1]
Izumi Asakura, Hidehiko Masuhara, and Tomoyuki Aotani. 2016. Proof of Soundness of Concurrent Separation Logic for GPGPU in Coq. Journal of Information Processing 24, 1 (2016), 132–140. https://doi.org/ 10.2197/ipsjjip.24.132
-
[2]
Fabian Bannwart and Peter Müller. 2005. A Program Logic for Bytecode. Electronic Notes in Theoretical Computer Science 141, 1 (2005), 255–273. https://doi.org/ https:// doi.org/10.1016/j.entcs.2005.02.026
-
[3]
Cutler, Saman Amarasinghe, and Jonathan Ragan-Kelley
Manya Bansal, Daniel Sainati, Joseph W. Cutler, Saman Amarasinghe, and Jonathan Ragan-Kelley. 2025. Modular GPU Programming with Typed Perspectives. Retrieved from https://arxiv.org/abs/2511.11939
-
[4]
Guillaume Bertholon. 2025. Interactive compilation via trustworthy source-to-source transformations. Theses. Retrieved from https://theses.hal.science/tel-05302456
work page 2025
-
[5]
Adam Betts, Nathan Chong, Alastair Donaldson, Shaz Qadeer, and Paul Thomson
-
[6]
GPUVerify: a verifier for GPU kernels. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '12) , 2012. Association for Computing Machinery, Tucson, Arizona, USA, 113–132. https://doi.org/ 10.1145/2384616.2384625
-
[7]
Stefan Blom, Saeed Darabi, Marieke Huisman, and Wytse Oortwijn. 2017. The VerCors Tool Set: Verification of Parallel and Concurrent Software. In Integrated Formal Meth- ods, 2017. Springer International Publishing, Cham, 102–110
work page 2017
-
[8]
Stephen Brookes and Peter W. O’Hearn. 2016. Concurrent Separation Logic. ACM SIGLOG News 3, 3 (August 2016), 47–65. https://doi.org/ 10.1145/2984450.2984457 69
-
[9]
Manuel MT Chakravarty, Gabriele Keller, Sean Lee, Trevor L McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the sixth workshop on Declarative aspects of multicore programming, 2011. 3–14
work page 2011
-
[10]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18), 2018. ...
work page 2018
-
[11]
Bastian Hagedorn, Johannes Lenfers, Thomas Kundefinedhler, Xueying Qin, Sergei Gorlatch, and Michel Steuwer. 2020. Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. Proc. ACM Program. Lang. 4, ICFP (August 2020). https://doi.org/ 10.1145/3408974
-
[12]
Mark Harris. 2012. How to Implement Performance Metrics in CUDA C/C++. Retrieved from https://developer.nvidia.com/blog/how-implement-performance- metrics-cuda-cc/
work page 2012
-
[13]
Troels Henriksen, Niels GW Serup, Martin Elsman, Fritz Henglein, and Cosmin E Oancea. 2017. Futhark: purely functional GPU-programming with nested parallelism and in-place array updates. In Proceedings of the 38th ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2017. 556–571
work page 2017
-
[14]
Bastian Köpcke, Sergei Gorlatch, and Michel Steuwer. 2024. Descend: A Safe GPU Sys- tems Programming Language. Proc. ACM Program. Lang. 8, PLDI (June 2024). https:// doi.org/ 10.1145/3656411
-
[15]
Dennis Liew, Tiago Cogumbreiro, and Julien Lange. 2024. Sound and Partially-Com - plete Static Analysis of Data-Races in GPU Programs. Proc. ACM Program. Lang. 8, OOPSLA2 (October 2024). https://doi.org/ 10.1145/3689797
-
[16]
Guido Martínez, Jonas Fiala, Abhinav Jangda, Angelica Moreira, Nikhil Swamy, and Tyler Sorensen. 2025. Kuiper: verified and efficient GPU programming. Retrieved from https://www.youtube.com/watch?v=imWt5cnZxb0
work page 2025
-
[17]
George Ciprian Necula. 1998. Compiling with proofs. Doctoral dissertation. 70
work page 1998
-
[18]
Nvidia. 2026. cuda-samples. Retrieved from https://github.com/NVIDIA/cuda- samples/tree/4f735616ba599fe93cc2c6c85dcb4369260f9643
work page 2026
-
[19]
O'Hearn, Reynolds, and Yang. 2001. Local Reasoning about Programs that Alter Data Structures. In CSL: 15th Workshop on Computer Science Logic , 2001. LNCS, Springer- Verlag. Retrieved from https://doi.org/10.1007/3-540-44802-0_1
-
[21]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Confer- ence on Programming Language Design and Implementation , 2013. https://doi.org/ 10.1145/2491956.2462176
-
[22]
Ari Rasch. 2024. (De/Re)-Composition of Data-Parallel Computations via Multi- Dimensional Homomorphisms. ACM Trans. Program. Lang. Syst. 46, 3 (October 2024). https://doi.org/ 10.1145/3665643
-
[23]
Rust Team. 2026. Rust Programming Language. Retrieved from https://rust-lang.org/
work page 2026
-
[24]
Ömer Şakar, Mohsen Safari, Marieke Huisman, and Anton Wijs. 2022. Alpinist: An An- notation-Aware GPU Program Optimizer. In Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7,...
work page 2022
-
[25]
https://doi.org/ 10.1007/978-3-030-99527-0\_18
Springer, 332–352. https://doi.org/ 10.1007/978-3-030-99527-0\_18
-
[26]
Ömer Şakar, Mohsen Safari, Marieke Huisman, and Anton Wijs. 2025. Preserving provability over GPU program optimizations with annotation-aware transforma - tions. Formal Methods in System Design (2025), 1–57. https://doi.org/ 10.1007/ s10703-025-00480-7 71
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.