pith. machine review for the scientific record. sign in

arxiv: 2502.10517 · v1 · submitted 2025-02-14 · 💻 cs.LG · cs.AI· cs.PF· cs.SE

Recognition: 2 theorem links

· Lean Theorem

KernelBench: Can LLMs Write Efficient GPU Kernels?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PFcs.SE
keywords LLM code generationGPU kernelsbenchmarkPyTorchperformance optimizationmachine learning workloadskernel speedup
0
0 comments X

The pith

Language models match PyTorch GPU kernel performance in fewer than 20 percent of cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

KernelBench introduces a benchmark of 250 PyTorch machine learning workloads to measure whether language models can generate kernels that are both correct and faster than standard implementations. The new fast_p metric tracks the percentage of outputs that meet a chosen speedup threshold p while remaining functionally accurate. Tests across frontier models show they perform best without assistance yet still succeed on less than 20 percent of the tasks, with modest gains from iterative refinement that uses execution and profiling feedback. Success on this benchmark would directly reduce the time experts spend writing custom kernels to speed up real ML systems.

Core claim

KernelBench evaluates language models on writing efficient GPU kernels for 250 real PyTorch ML workloads in a setting that mirrors production engineering needs. Frontier reasoning models achieve the highest out-of-the-box success rates but still produce kernels that are correct and faster than the PyTorch baseline in under 20 percent of cases. Iterative refinement that incorporates runtime execution and profiling feedback improves results, yet the benchmark becomes substantially harder as the required speedup threshold p is raised.

What carries the argument

KernelBench, a suite of 250 PyTorch workloads together with the fast_p metric that reports the share of generated kernels which are functionally correct and exceed a speedup threshold p over baseline

If this is right

  • Progress on KernelBench directly translates into faster practical kernels for machine learning systems
  • Iterative refinement that uses execution and profiling feedback raises the number of successful kernels
  • Raising the speedup threshold p increases the difficulty of the benchmark for all tested models
  • Frontier reasoning models achieve the best performance when generating kernels without extra techniques

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may close more of the gap if trained on larger corpora of low-level GPU code
  • The benchmark could test whether new test-time search methods outperform simple iterative feedback
  • Wider use of such evaluation suites might reduce dependence on manual kernel tuning in ML development

Load-bearing premise

The 250 selected workloads are representative of the kernels that matter most in current and near-future ML systems

What would settle it

An LLM that produces functionally correct kernels offering at least a 1x speedup over PyTorch on more than 30 percent of the 250 workloads would contradict the reported overall shortfall

read the original abstract

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces KernelBench, an open-source evaluation framework consisting of 250 carefully selected PyTorch ML workloads to measure LLMs' ability to generate functionally correct and high-performance GPU kernels. It defines the fast_p metric (percentage of kernels that are correct and exceed a tunable speedup threshold p over a PyTorch baseline) and reports that frontier reasoning models achieve the highest out-of-the-box success but still match the baseline in fewer than 20% of cases, with modest gains from iterative refinement that incorporates execution and profiling feedback.

Significance. If the workload suite is representative, the results establish a clear, reproducible baseline showing that current LLMs remain far from replacing expert kernel engineering for real ML systems. The open release of the benchmark, the fast_p metric, and the empirical comparison across multiple models and test-time strategies constitute a concrete contribution that can guide subsequent work on execution-aware code generation.

major comments (1)
  1. [Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.
minor comments (2)
  1. [Abstract and Evaluation] Abstract and §4: the description of functional-correctness verification for fast_p should explicitly state the test harness, numerical tolerance, and failure modes considered.
  2. [Results] Figure and table captions: ensure every speedup plot and table reports the exact value of p used and the number of samples per model.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review. We address the major comment on workload curation below and have updated the manuscript to include additional quantitative details on the benchmark suite.

read point-by-point responses
  1. Referee: [Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.

    Authors: We agree that a quantitative breakdown strengthens the claims. In the revised manuscript we have expanded the Workload Curation section with: (1) an operation-class distribution table (e.g., GEMM 38%, convolution 27%, elementwise 18%, reduction 12%, other 5%), (2) model-family coverage (ResNet/VGG 22%, Transformer 35%, diffusion 15%, other vision/language 28%), and (3) a brief comparison to MLPerf and public production traces showing substantial overlap in dominant operations. We have also revised the abstract and introduction to state that progress on KernelBench is expected to translate to practical kernels for workloads of similar structure rather than claiming universal applicability. These additions allow readers to assess representativeness directly. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark with external baseline comparison.

full rationale

The paper introduces KernelBench as an empirical evaluation suite of 250 PyTorch workloads, measuring LLM-generated kernels against a fixed external PyTorch baseline using the fast_p metric. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The central result (frontier models match baseline in <20% of cases) is a direct count from execution, not reduced by construction to any input definition or prior self-citation. Representativeness of workloads is an external-validity issue, not a circularity flaw in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no free parameters, axioms, or invented entities; it relies on standard empirical evaluation practices against an external baseline.

pith-pipeline@v0.9.0 · 5508 in / 1035 out tokens · 34322 ms · 2026-05-15T16:50:29.363516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

    cs.AI 2026-05 unverdicted novelty 8.0

    VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...

  2. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  3. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  4. CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.

  5. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  6. KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

    cs.LG 2026-05 unverdicted novelty 7.0

    KernelBench-X benchmark shows task category predicts LLM kernel correctness better than method choice, iterative refinement trades performance for higher success rates, and correctness does not ensure efficiency gains...

  7. KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

    cs.LG 2026-05 conditional novelty 7.0

    KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization rema...

  8. ProgramBench: Can Language Models Rebuild Programs From Scratch?

    cs.SE 2026-05 unverdicted novelty 7.0

    ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...

  9. Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

    cs.SE 2026-05 conditional novelty 7.0

    Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...

  10. Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

    cs.SE 2026-05 conditional novelty 7.0

    Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.

  11. FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

    cs.DC 2026-04 unverdicted novelty 7.0

    FACT is a three-stage agent-driven system that synthesizes and composes CUTLASS kernels from PyTorch modules, achieving up to 2.03x speedup on transformer blocks over PyTorch and competing optimizers.

  12. Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon

    cs.LG 2026-04 unverdicted novelty 7.0

    Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.

  13. SkillEvolver: Skill Learning as a Meta-Skill

    cs.AI 2026-05 unverdicted novelty 6.0

    A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.

  14. Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization

    cs.PF 2026-04 unverdicted novelty 6.0

    Optimas deploys a multi-agent LLM workflow to convert performance diagnostics into correct code transformations, delivering 100% valid code and performance gains in 98.82% of 3,410 experiments across benchmarks and HP...

  15. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  16. AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...

  17. MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

    cs.AR 2026-04 unverdicted novelty 6.0

    MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

  18. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  19. InCoder-32B-Thinking: Industrial Code World Model for Thinking

    cs.AR 2026-04 unverdicted novelty 6.0

    InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

  20. Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

    cs.CL 2026-03 unverdicted novelty 6.0

    Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...

  21. Benchmarking Compound AI Applications for Hardware-Software Co-Design

    cs.DC 2026-03 unverdicted novelty 6.0

    Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 19 Pith papers · 6 internal anchors

  1. [1]

    Apple ml compute framework (mlx), 2020

    Apple. Apple ml compute framework (mlx), 2020. URL https://developer.apple.com/metal/

  2. [2]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher R´ e. Simple linear attention language models balance the recall-throughput tradeoff. International Conference on Machine Learning , 2024

  3. [3]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R´ e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787

  4. [4]

    Cerebras wafer-scale engine wse architecture

    Cerebras. Cerebras wafer-scale engine wse architecture. Online. https://cerebras.ai/product-chip/

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  6. [6]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. International Conference on Learning Representations, 2024

  7. [7]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. International Conference on Machine Learning (ICML) , 2024

  8. [8]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022

  9. [9]

    Deepseek-v3 technical report, 2025

    DeepSeek-AI. Deepseek-v3 technical report, 2025. URL https://github.com/deepseek-ai/ DeepSeek-V3

  10. [10]

    Graphcore IPU architecture

    Graphcore. Graphcore IPU architecture. Online. https://www.graphcore.ai/products/ipu

  11. [11]

    Groq architecture

    Groq. Groq architecture. Online. https://groq.com/

  12. [12]

    Priority sampling of large language models for compilers, 2024

    Dejan Grubisic, Chris Cummins, Volker Seeker, and Hugh Leather. Priority sampling of large language models for compilers, 2024. URL https://arxiv.org/abs/2402.18734. 11

  13. [13]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. URL https://arxiv.org/ abs/1606.08415

  14. [14]

    Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023. URL https://arxiv.org/abs/2304.01433

  15. [15]

    Flashattention minimal

    Peter Kim. Flashattention minimal. Online, 2024. https://github.com/tspeterkim/ flash-attention-minimal

  16. [16]

    The stack: 3 tb of permissively licensed source code, 2022

    Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022. URL https: //arxiv.org/abs/2211.15533

  17. [17]

    Ds-1000: A natural and reliable benchmark for data science code generation, 2022

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation, 2022. URL https://arxiv.org/abs/2211.11501

  18. [18]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo...

  19. [19]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

  20. [20]

    Christian J. Mills. Cuda mode notes - lecture 004. Online, 2024. https://christianjmills.com/ posts/cuda-mode-notes/lecture-004/

  21. [21]

    Performance-aligned llms for generating fast code, 2024

    Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code, 2024. URL https://arxiv.org/abs/2404. 18864

  22. [22]

    cudnn: Gpu-accelerated library for deep neural networks, 2014

    NVIDIA. cudnn: Gpu-accelerated library for deep neural networks, 2014. URL https://developer. nvidia.com/cudnn

  23. [23]

    Cuda templates for linear algebra subroutines, 2017

    NVIDIA. Cuda templates for linear algebra subroutines, 2017. URL https://github.com/NVIDIA/ cutlass

  24. [24]

    Nvidia Tesla V100 GPU architecture, 2017

    NVIDIA. Nvidia Tesla V100 GPU architecture, 2017

  25. [25]

    Nvidia A100 tensor core GPU architecture, 2020

    NVIDIA. Nvidia A100 tensor core GPU architecture, 2020

  26. [26]

    Nvidia H100 tensor core GPU architecture, 2022

    NVIDIA. Nvidia H100 tensor core GPU architecture, 2022. 12

  27. [27]

    cuBLAS, 2023

    NVIDIA. cuBLAS, 2023. URL https://docs.nvidia.com/cuda/cublas/

  28. [28]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨ opf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfor...

  29. [29]

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, and Jiaming et al. Kong. Rwkv: Reinventing rnns for the transformer era. Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023

  30. [30]

    Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/ abs/2407.08608

  31. [31]

    Can language models solve olympiad programming?, 2024

    Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?, 2024. URL https://arxiv.org/abs/2404.10952

  32. [32]

    Thunderkittens: Simple, fast, and adorable ai kernels

    Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Fu, and Christopher R´ e. Thunderkittens: Simple, fast, and adorable ai kernels. International Conference on Learning Representations (ICLR) , 2024

  33. [33]

    Efficient transformers: A survey

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

  34. [34]

    FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024

    Team PyTorch, Horace He, Driss Guessous, Yanbo Liang, and Joy Dong. FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024. URL https://pytorch.org/blog/ flexattention/

  35. [35]

    Ahmed, Amir Yazdanbakhsh, and Ali Jannesari

    Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, and Ali Jannesari. Coderosetta: Pushing the boundaries of unsupervised code translation for parallel programming. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/forum?id=V6hrg4O9gg

  36. [36]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

  37. [37]

    Alan M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceed- ings of the London Mathematical Society , 2(42):230–265, 1936. URL http://www.cs.helsinki.fi/u/ gionis/cc05/OnComputableNumbers.pdf

  38. [38]

    Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S

    Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S. Vetter. Comparing llama-2 and gpt-3 llms for hpc kernels generation, 2023. URL https://arxiv.org/abs/2309.07103

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017) , 2017

  40. [40]

    Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15362–15376, Miami, Florida, USA,...

  41. [41]

    BabelTower: Learning to auto-parallelized program translation

    Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, and Yunji Chen. BabelTower: Learning to auto-parallelized program translation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Co...

  42. [42]

    Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts, 2024

    Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. Re-bench: Evaluating frontier ai r&d c...

  43. [43]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv:2405.15793, 2024

  44. [44]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URL https://arxiv.org/abs/2410.03859

  45. [45]

    Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024

    Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/sustcsonglin/ flash-linear-attention

  46. [46]

    " " 6 Simple model that p erf or ms a single matrix m u l t i p l i c a t i o n ( C = A * B ) with a large K ,→ d i m e n s i o n 7

    Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, and Charles Sutton. Natural language to code generation in interactive data science notebooks, 2022. URL https://arxiv.org/abs/2212. 09248. 14 A KernelBench Task Example Here we provide an example t...