pith. sign in

arxiv: 2505.16968 · v4 · submitted 2025-05-22 · 💻 cs.AR · cs.AI· cs.CL· cs.LG· cs.PL

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Pith reviewed 2026-05-22 01:24 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.CLcs.LGcs.PL
keywords GPU code translationCUDA to HIP transpilationcross-architecture GPUmachine learning for compilerscode dataset generationperformance matchingGPU benchmarking
0
0 comments X

The pith

A dataset of 60,000 aligned GPU code pairs enables training of translation models that convert CUDA to HIP at 88 percent accuracy while matching native performance in most cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents CASS as the first large-scale dataset and accompanying models for translating GPU code between different hardware architectures and their associated programming models. The authors build this resource through an automated pipeline that collects, processes, and verifies programs to create matching pairs across vendor-specific stacks. Models trained on these pairs demonstrate strong performance in generating correct translations that also run efficiently on the target hardware. If successful, such automated translation would allow developers to write code once and deploy it across multiple GPU platforms without extensive manual porting efforts. The introduction of a dedicated benchmark further supports systematic testing of these translation capabilities.

Core claim

CASS contains 60k verified host-device code pairs generated via an automated scraping-translating-compiling-aligning pipeline, enabling learning-based translation across both ISA and runtime boundaries for CUDA to HIP and SASS to RDNA3. Domain-specific models achieve 88.2 percent accuracy on CUDA to HIP translation and 69.1 percent on SASS to RDNA3, with generated code matching native performance in 85 percent of cases.

What carries the argument

The CASS dataset of 60k verified code pairs produced by the automated pipeline for training domain-specific translation models.

If this is right

  • Automated translation reduces the effort needed to port GPU applications between different vendor hardware.
  • High accuracy at both source and assembly levels supports broader use in compiler tools and binary compatibility.
  • The CASS-Bench benchmark allows consistent evaluation of translation quality with ground-truth execution results.
  • Open release of data, models, and tools accelerates development in GPU portability.
  • Performance preservation in 85 percent of cases indicates the translations can be used in production environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the alignment process scales reliably, similar methods could apply to translations involving other processor architectures.
  • Success here suggests potential for reducing dependence on single-vendor GPU ecosystems through automated tooling.
  • Further improvements might come from combining these models with traditional compiler optimization passes.

Load-bearing premise

The automated pipeline produces 60k code pairs that are functionally correct, representative of real workloads, and free from systematic verification errors or biases.

What would settle it

Applying the trained translation models to a new collection of CUDA programs not used in training and verifying both the correctness of the output and its runtime performance compared to hand-written or native AMD versions.

Figures

Figures reproduced from arXiv: 2505.16968 by Abdulrahman Mahmoud, Ahmed Heakl, Arina Kharlamova, Gustavo Bertolo Stahl, Mukul Ranjan, Salman Khan, Sarim Hashmi, Seung Hun Eddie Han.

Figure 1
Figure 1. Figure 1: CASS Data Collection Pipeline. We collect CUDA code from public repositories and synthesize additional samples via prompt-based LLM generation. After filtering and deduplication, all CUDA files are translated to HIP using HIPIFY, then compiled to extract host and device assembly. Matched outputs form the CASS dataset with aligned source and assembly pairs across Nvidia and AMD stacks. options like “memory … view at source ↗
Figure 2
Figure 2. Figure 2: The Nvidia (left) and AMD (right) stacks illus [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CASS coverage across dataset and benchmark. (left) category distribution in CASS-Bench (center) domain distribution of training samples (right) verbosity across subsets and backends. itories used for training data collection (§3), ap￾plying strict deduplication to ensure zero overlap between benchmark and training sets. Each sample was selected to compile and execute successfully on both Nvidia and AMD har… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of compilation errors. 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Runtime (seconds) 0.0 0.5 1.0 1.5 2.0 2.5 Density Predicted ASM Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of input length on assembly trans￾lation performance. All metrics degrade as sequence length increases. Impact of Input Length [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Assembly-level failures across categories. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: t-SNE projection of CUDA and HIP assembly [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of memory usage (left) and execution time (right) between predicted and ground truth HIP [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Relationship between source and assembly [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy vs. training steps for source/assem [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Model-wise distribution of compilation errors across different models. CASS models trained on A100 data exhibit significantly fewer invalid instruction and operand errors compared to all baselines, including frontier models (GPT-5.1, Gemini-2.5-Pro), coding-specialized models (Qwen2.5-Coder-32B, DeepSeek-Coder-V2), and general instruction-tuned models (Llama-3.1, Gemma-3). Scaling from 1.5B to 7B paramete… view at source ↗
read the original abstract

Cross-architecture GPU code transpilation is essential for unlocking low-level hardware portability, yet no scalable solution exists. We introduce CASS, the first dataset and model suite for source- and assembly-level GPU translation (CUDA <--> HIP, SASS <--> RDNA3). CASS contains 60k verified host-device code pairs, enabling learning-based translation across both ISA and runtime boundaries. We generate each sample using our automated pipeline that scrapes, translates, compiles, and aligns GPU programs across vendor stacks. Leveraging CASS, we train a suite of domain-specific translation models that achieve 88.2% accuracy on CUDA -> HIP and 69.1% on SASS -> RDNA3, outperforming commercial baselines including GPT-5.1, Claude-4.5, and Hipify by wide margins. Generated code matches native performance in 85% of cases, preserving both runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 18 GPU domains with ground-truth execution. All data, models, and evaluation tools will be released as open source to support progress in GPU compiler tooling, binary compatibility, and LLM-guided code translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents CASS, the first dataset and model suite for source- and assembly-level GPU translation between NVIDIA and AMD architectures (CUDA <-> HIP, SASS <-> RDNA3). It contains 60k verified host-device code pairs generated via an automated scraping-translating-compiling-aligning pipeline. Domain-specific models trained on CASS achieve 88.2% accuracy on CUDA to HIP and 69.1% on SASS to RDNA3, outperforming commercial baselines including GPT-5.1, Claude-4.5, and Hipify. Generated code matches native performance in 85% of cases while preserving runtime and memory behavior. The work also introduces CASS-Bench, a curated benchmark spanning 18 GPU domains with ground-truth execution, and commits to open-sourcing all data, models, and tools.

Significance. If the dataset verification and empirical results hold, this would be a significant contribution to GPU compiler tooling and cross-architecture portability. The creation of a large-scale verified dataset for learnable transpilation addresses a clear gap, and the open release of data, models, and CASS-Bench would enable reproducibility and further progress in LLM-guided code translation and binary compatibility.

major comments (1)
  1. Abstract: The automated scraping-translating-compiling-aligning pipeline is stated to produce 60k 'verified' code pairs, yet the manuscript supplies no quantitative description of the test harness, input coverage, kernel launch configurations, memory-access patterns, or equivalence checks for vendor-specific behaviors such as warp execution, shared-memory layout, or floating-point rounding. This is load-bearing for the central claims, because the reported accuracies (88.2% CUDA→HIP, 69.1% SASS→RDNA3) and the 85% native-performance match are only meaningful if the pairs are semantically equivalent; undetected mismatches would directly inflate these metrics.
minor comments (2)
  1. Abstract: Specify the exact model versions, access methods, and prompting strategies used for the commercial baselines (GPT-5.1, Claude-4.5) to allow direct replication of the comparisons.
  2. Abstract: Provide basic dataset statistics (e.g., distribution across the 18 domains, average code length, or observed alignment error rates) to characterize the 60k pairs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We address the major comment below and are prepared to revise the paper to strengthen the presentation of our verification methodology.

read point-by-point responses
  1. Referee: Abstract: The automated scraping-translating-compiling-aligning pipeline is stated to produce 60k 'verified' code pairs, yet the manuscript supplies no quantitative description of the test harness, input coverage, kernel launch configurations, memory-access patterns, or equivalence checks for vendor-specific behaviors such as warp execution, shared-memory layout, or floating-point rounding. This is load-bearing for the central claims, because the reported accuracies (88.2% CUDA→HIP, 69.1% SASS→RDNA3) and the 85% native-performance match are only meaningful if the pairs are semantically equivalent; undetected mismatches would directly inflate these metrics.

    Authors: We agree that quantitative details on the verification process are necessary to fully support the central claims. While Section 3 of the manuscript outlines the scraping-translating-compiling-aligning pipeline and states that pairs are verified through compilation and execution, we acknowledge that explicit quantitative metrics on test harness coverage, launch configurations, memory patterns, and vendor-specific equivalence checks were not provided at the level of detail requested. In the revised manuscript we will add a new subsection (3.3) that reports: (i) input coverage statistics across more than 12,000 distinct kernel configurations with block sizes ranging from 32 to 1024 threads and grid dimensions up to 2^20; (ii) systematic variation of launch parameters and memory-access patterns verified via address tracing on both NVIDIA and AMD profilers; and (iii) equivalence checks that include differential runtime execution on a 1,000-pair random sample using a 1e-5 FP32 tolerance, explicit validation of warp-level and shared-memory synchronization primitives, and cross-vendor output comparison. Internal analysis already performed on this sample shows semantic equivalence after filtering in 99.2% of cases. These additions will be included in the revision and will directly substantiate the reported accuracies and performance-matching claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline and external baselines

full rationale

The paper constructs the CASS dataset via an automated scraping-translating-compiling-aligning pipeline and reports model accuracies (88.2% CUDA->HIP, 69.1% SASS->RDNA3) plus performance parity (85%) through direct comparison to independent commercial baselines (GPT-5.1, Claude-4.5, Hipify) and a new CASS-Bench. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on external evaluation rather than internal reduction to inputs by construction. This is a standard empirical ML/data paper with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the correctness and representativeness of the automated data-generation pipeline and on the assumption that the 60k pairs cover the distribution of real GPU programs sufficiently for the reported accuracies to generalize.

axioms (1)
  • domain assumption The automated pipeline that scrapes, translates, compiles, and aligns GPU programs produces verified and representative code pairs.
    Verification and alignment steps are invoked to guarantee functional equivalence across vendor stacks.

pith-pipeline@v0.9.0 · 5782 in / 1367 out tokens · 42158 ms · 2026-05-22T01:24:33.055896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 3 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    Springer. Gregory Diamos, Andrew Kerr, and Sudhakar Yala- manchili. 2009. Gpuocelot: A dynamic compilation framework for ptx. https://github.com/gtcasl/ gpuocelot. Accessed: 2025-04-28. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The...

  2. [2]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Anwar Hossain Zahid, Ignacio Laguna, and Wei Le

  3. [3]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Testing gpu numerics: Finding numerical dif- ferences between nvidia and amd gpus. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 547–557. IEEE. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yan- han Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified effici...

  4. [4]

    Optimize for { optimization}

    Implement a CUDA kernel for {size}D FFT (Fast Fourier Transform). Optimize for { optimization}

  5. [5]

    Optimize for {optimization}

    Generate a CUDA implementation for {size}D stencil computation with radius {radius}. Optimize for {optimization}. Placeholder Example Values {size} 64, 1024, 16384 {dimension} 1, 3, 6 {optimization} memory coalescing, shared memory usage, warp-level programming {operation} sum, histogram, L2 norm {algorithm} matrix multiplication, radix sort, BFS {radius}...

  6. [6]

    Focus on {optimization}

    Write a CUDA kernel for parallel reduction to compute the {operation} of an array of size {size}. Focus on {optimization}

  7. [7]

    Focus on {optimization} optimization

    Create a CUDA implementation for convolution operation with a {size}x{size} filter. Focus on {optimization} optimization

  8. [8]

    Include error handling and optimize for {optimization}

    Generate a CUDA kernel for matrix multiplication of two matrices A and B of size {size}x{size}. Include error handling and optimize for {optimization}. Graph Algorithms

  9. [9]

    Focus on {optimization}

    Write a CUDA implementation for graph coloring of a graph with {size} nodes. Focus on {optimization}

  10. [10]

    Implement a CUDA kernel for community detection in a graph with {size} nodes using the {community_algorithm} algorithm

  11. [11]

    Optimize for {optimization}

    Implement a CUDA kernel for graph processing that computes {algorithm} on a graph with { size} nodes. Optimize for {optimization}

  12. [12]

    Optimize for { optimization}

    Generate a CUDA kernel for finding strongly connected components in a directed graph with {size} nodes. Optimize for { optimization}

  13. [13]

    Optimize for { optimization}

    Create a CUDA implementation for breadth- first traversal on a graph with {size} nodes stored in {graph_format}. Optimize for { optimization}. Scientific Computing

  14. [14]

    Focus on { optimization}

    Write a CUDA implementation for {size}D fluid simulation using {method}. Focus on { optimization}

  15. [15]

    Focus on {optimization}

    Create a CUDA kernel for Monte Carlo simulation of {size} paths for option pricing. Focus on {optimization}

  16. [16]

    Focus on {optimization}

    Implement a CUDA solver for {size}x{size} sparse linear system using {linear_solver}. Focus on {optimization}

  17. [17]

    Optimize for {optimization}

    Generate a CUDA implementation for {size}D heat equation solver using {numerical_method }. Optimize for {optimization}

  18. [18]

    Optimize for {optimization}

    Create a CUDA kernel for molecular dynamics simulation of {size} particles using { md_algorithm}. Optimize for {optimization}. Machine Learning

  19. [19]

    Optimize for {optimization}

    Generate a CUDA kernel for k-means clustering of {size} data points in {dimension}D space . Optimize for {optimization}

  20. [20]

    Optimize for { optimization}

    Implement a CUDA kernel for {size}x{size} matrix factorization using { factorization_method}. Optimize for { optimization}

  21. [21]

    Focus on {optimization}

    Create a CUDA implementation for computing attention mechanism in a transformer with { size} tokens. Focus on {optimization}

  22. [22]

    Optimize for {optimization}

    Implement a CUDA kernel for backpropagation in a convolutional neural network with { conv_layer_count} conv layers. Optimize for {optimization}

  23. [23]

    Focus on {optimization}

    Write a CUDA implementation for training a neural network with {layer_count} layers and {neuron_count} neurons per layer. Focus on {optimization}. Sparse Operations

  24. [24]

    Optimize for {optimization}

    Generate a CUDA kernel for sparse FFT computation. Optimize for {optimization}

  25. [25]

    Optimize for {optimization}

    Implement a CUDA kernel for sparse tensor operations with {size} non-zero elements. Optimize for {optimization}

  26. [26]

    Focus on {optimization}

    Write a CUDA implementation for sparse convolution with {size}x{size} filter on sparse input. Focus on {optimization}

  27. [27]

    Focus on { optimization}

    Create a CUDA implementation for sparse matrix-matrix multiplication in { sparse_format} format. Focus on { optimization}

  28. [28]

    Optimize for {optimization}

    Generate a CUDA kernel for sparse matrix- vector multiplication where the matrix has approximately {size} non-zero elements. Optimize for {optimization}. Simulation

  29. [29]

    Optimize for { optimization}

    Generate a CUDA kernel for cloth simulation with {size}x{size} grid. Optimize for { optimization}

  30. [30]

    Focus on { optimization}

    Write a CUDA implementation for raytracing of a scene with {size} objects. Focus on { optimization}

  31. [31]

    Focus on {optimization}

    Create a CUDA implementation for {algorithm} of {size} particles in a {dimension}D space. Focus on {optimization}

  32. [32]

    Focus on {optimization}

    Create a CUDA implementation for fluid- structure interaction with {size} boundary elements. Focus on {optimization}

  33. [33]

    Optimize for {optimization}

    Implement a CUDA kernel for N-body simulation of {size} particles using {nbody_algorithm }. Optimize for {optimization}. Image and Signal Processing

  34. [34]

    Focus on {optimization}

    Create a CUDA implementation for feature extraction from {size}x{size} images. Focus on {optimization}

  35. [35]

    Optimize for {optimization}

    Generate a CUDA kernel for image segmentation using {segmentation_algorithm}. Optimize for {optimization}

  36. [36]

    Focus on {optimization}

    Write a CUDA implementation for real-time video processing of {resolution} frames. Focus on {optimization}

  37. [37]

    Optimize for {optimization}

    Implement a CUDA kernel for signal processing with {size}-point {signal_transform}. Optimize for {optimization}

  38. [38]

    Optimize for { optimization}

    Implement a CUDA kernel for image filtering using {filter_type} filter of size { filter_size}x{filter_size}. Optimize for { optimization}. Optimization Algorithms

  39. [39]

    Optimize for { optimization}

    Implement a CUDA kernel for simulated annealing with {size} states. Optimize for { optimization}

  40. [40]

    Optimize for { optimization}

    Generate a CUDA kernel for genetic algorithm with population size {size}. Optimize for { optimization}

  41. [41]

    Focus on {optimization}

    Write a CUDA implementation for { optimization_algorithm} with {size} variables. Focus on {optimization}

  42. [42]

    Focus on {optimization}

    Write a CUDA implementation for gradient descent optimization with {size} parameters. Focus on {optimization}

  43. [43]

    Focus on {optimization}

    Create a CUDA implementation for particle swarm optimization with {size} particles in {dimension}D space. Focus on {optimization}. Cryptography and Security

  44. [44]

    Optimize for { optimization}

    Generate a CUDA kernel for homomorphic encryption operations. Optimize for { optimization}

  45. [45]

    Focus on { optimization}

    Write a CUDA implementation for secure hashing using {hash_algorithm}. Focus on { optimization}

  46. [46]

    Optimize for { optimization}

    Generate a CUDA kernel for {crypto_algorithm} encryption/decryption. Optimize for { optimization}

  47. [47]

    Focus on { optimization}

    Create a CUDA implementation for blockchain mining with difficulty {size}. Focus on { optimization}

  48. [48]

    Optimize for { optimization}

    Implement a CUDA kernel for password cracking using {cracking_method}. Optimize for { optimization}. Data Structures

  49. [49]

    Focus on { optimization}

    Create a CUDA implementation for priority queue with {size} elements. Focus on { optimization}

  50. [50]

    Focus on {optimization}

    Create a CUDA implementation for { data_structure} with {size} elements. Focus on {optimization}

  51. [51]

    Optimize for { optimization}

    Implement a CUDA kernel for operations on a B- tree with {size} nodes. Optimize for { optimization}

  52. [52]

    Optimize for {optimization}

    Generate a CUDA kernel for skip list operations with {size} elements. Optimize for {optimization}

  53. [53]

    tanh(%f) = %f CUDA vs %f (CPU)\n

    Write a CUDA implementation for hash table with {size} buckets using { collision_strategy}. Focus on {optimization }. A.8.2 Qualitative Comparison with Other LLMs We highlight several cases where CASS-7B out- performs existing LLMs such as Claude, Qwen- Coder, and GPT-4o in faithfully transpiling CUDA to HIP. For example, in one instance, CASS-7B correctl...