arxiv: 2502.10517 · v1 · submitted 2025-02-14 · 💻 cs.LG · cs.AI· cs.PF· cs.SE

Recognition: 2 theorem links

· Lean Theorem

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang , Simon Guo , Simran Arora , Alex L. Zhang , William Hu , Christopher R\'e , Azalia Mirhoseini

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PFcs.SE

keywords LLM code generationGPU kernelsbenchmarkPyTorchperformance optimizationmachine learning workloadskernel speedup

0 comments

The pith

Language models match PyTorch GPU kernel performance in fewer than 20 percent of cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

KernelBench introduces a benchmark of 250 PyTorch machine learning workloads to measure whether language models can generate kernels that are both correct and faster than standard implementations. The new fast_p metric tracks the percentage of outputs that meet a chosen speedup threshold p while remaining functionally accurate. Tests across frontier models show they perform best without assistance yet still succeed on less than 20 percent of the tasks, with modest gains from iterative refinement that uses execution and profiling feedback. Success on this benchmark would directly reduce the time experts spend writing custom kernels to speed up real ML systems.

Core claim

KernelBench evaluates language models on writing efficient GPU kernels for 250 real PyTorch ML workloads in a setting that mirrors production engineering needs. Frontier reasoning models achieve the highest out-of-the-box success rates but still produce kernels that are correct and faster than the PyTorch baseline in under 20 percent of cases. Iterative refinement that incorporates runtime execution and profiling feedback improves results, yet the benchmark becomes substantially harder as the required speedup threshold p is raised.

What carries the argument

KernelBench, a suite of 250 PyTorch workloads together with the fast_p metric that reports the share of generated kernels which are functionally correct and exceed a speedup threshold p over baseline

If this is right

Progress on KernelBench directly translates into faster practical kernels for machine learning systems
Iterative refinement that uses execution and profiling feedback raises the number of successful kernels
Raising the speedup threshold p increases the difficulty of the benchmark for all tested models
Frontier reasoning models achieve the best performance when generating kernels without extra techniques

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may close more of the gap if trained on larger corpora of low-level GPU code
The benchmark could test whether new test-time search methods outperform simple iterative feedback
Wider use of such evaluation suites might reduce dependence on manual kernel tuning in ML development

Load-bearing premise

The 250 selected workloads are representative of the kernels that matter most in current and near-future ML systems

What would settle it

An LLM that produces functionally correct kernels offering at least a 1x speedup over PyTorch on more than 30 percent of the 250 workloads would contradict the reported overall shortfall

read the original abstract

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KernelBench is a practical new benchmark for LLM-generated GPU kernels that ties evaluation to real speedups, but the low success rates rest on an unexamined selection of the 250 workloads.

read the letter

KernelBench sets up 250 PyTorch ML workloads to test whether language models can produce correct GPU kernels that beat standard baselines. They add the fast_p metric to count how many outputs clear both correctness and a chosen speedup threshold. Frontier reasoning models lead the pack but still only match the PyTorch baseline in under 20 percent of cases, and adding execution feedback during refinement lifts the numbers without solving the problem outright. The framework is open, which lets others run the same tests or add workloads later.

Referee Report

1 major / 2 minor

Summary. The paper introduces KernelBench, an open-source evaluation framework consisting of 250 carefully selected PyTorch ML workloads to measure LLMs' ability to generate functionally correct and high-performance GPU kernels. It defines the fast_p metric (percentage of kernels that are correct and exceed a tunable speedup threshold p over a PyTorch baseline) and reports that frontier reasoning models achieve the highest out-of-the-box success but still match the baseline in fewer than 20% of cases, with modest gains from iterative refinement that incorporates execution and profiling feedback.

Significance. If the workload suite is representative, the results establish a clear, reproducible baseline showing that current LLMs remain far from replacing expert kernel engineering for real ML systems. The open release of the benchmark, the fast_p metric, and the empirical comparison across multiple models and test-time strategies constitute a concrete contribution that can guide subsequent work on execution-aware code generation.

major comments (1)

[Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.

minor comments (2)

[Abstract and Evaluation] Abstract and §4: the description of functional-correctness verification for fast_p should explicitly state the test harness, numerical tolerance, and failure modes considered.
[Results] Figure and table captions: ensure every speedup plot and table reports the exact value of p used and the number of samples per model.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive review. We address the major comment on workload curation below and have updated the manuscript to include additional quantitative details on the benchmark suite.

read point-by-point responses

Referee: [Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.

Authors: We agree that a quantitative breakdown strengthens the claims. In the revised manuscript we have expanded the Workload Curation section with: (1) an operation-class distribution table (e.g., GEMM 38%, convolution 27%, elementwise 18%, reduction 12%, other 5%), (2) model-family coverage (ResNet/VGG 22%, Transformer 35%, diffusion 15%, other vision/language 28%), and (3) a brief comparison to MLPerf and public production traces showing substantial overlap in dominant operations. We have also revised the abstract and introduction to state that progress on KernelBench is expected to translate to practical kernels for workloads of similar structure rather than claiming universal applicability. These additions allow readers to assess representativeness directly. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark with external baseline comparison.

full rationale

The paper introduces KernelBench as an empirical evaluation suite of 250 PyTorch workloads, measuring LLM-generated kernels against a fixed external PyTorch baseline using the fast_p metric. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The central result (frontier models match baseline in <20% of cases) is a direct count from execution, not reduced by construction to any input definition or prior self-citation. Representativeness of workloads is an external-validity issue, not a circularity flaw in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no free parameters, axioms, or invented entities; it relies on standard empirical evaluation practices against an external baseline.

pith-pipeline@v0.9.0 · 5508 in / 1035 out tokens · 34322 ms · 2026-05-15T16:50:29.363516+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
cs.LG 2026-05 unverdicted novelty 7.0

CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
cs.LG 2026-05 unverdicted novelty 7.0

KernelBench-X benchmark shows task category predicts LLM kernel correctness better than method choice, iterative refinement trades performance for higher success rates, and correctness does not ensure efficiency gains...
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
cs.LG 2026-05 conditional novelty 7.0

KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization rema...
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
cs.SE 2026-05 conditional novelty 7.0

Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
cs.SE 2026-05 conditional novelty 7.0

Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
cs.DC 2026-04 unverdicted novelty 7.0

FACT is a three-stage agent-driven system that synthesizes and composes CUTLASS kernels from PyTorch modules, achieving up to 2.03x speedup on transformer blocks over PyTorch and competing optimizers.
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
cs.LG 2026-04 unverdicted novelty 7.0

Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
SkillEvolver: Skill Learning as a Meta-Skill
cs.AI 2026-05 unverdicted novelty 6.0

A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
cs.PF 2026-04 unverdicted novelty 6.0

Optimas deploys a multi-agent LLM workflow to convert performance diagnostics into correct code transformations, delivering 100% valid code and performance gains in 98.82% of 3,410 experiments across benchmarks and HP...
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
cs.CL 2026-04 unverdicted novelty 6.0

AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
AI-Driven Research for Databases
cs.DB 2026-04 unverdicted novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
cs.CL 2026-03 unverdicted novelty 6.0

Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
Benchmarking Compound AI Applications for Hardware-Software Co-Design
cs.DC 2026-03 unverdicted novelty 6.0

Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 19 Pith papers · 6 internal anchors

[1]

Apple ml compute framework (mlx), 2020

Apple. Apple ml compute framework (mlx), 2020. URL https://developer.apple.com/metal/

work page 2020
[2]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher R´ e. Simple linear attention language models balance the recall-throughput tradeoff. International Conference on Machine Learning , 2024

work page 2024
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R´ e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Cerebras wafer-scale engine wse architecture

Cerebras. Cerebras wafer-scale engine wse architecture. Online. https://cerebras.ai/product-chip/

work page
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. International Conference on Learning Representations, 2024

work page 2024
[7]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. International Conference on Machine Learning (ICML) , 2024

work page 2024
[8]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[9]

Deepseek-v3 technical report, 2025

DeepSeek-AI. Deepseek-v3 technical report, 2025. URL https://github.com/deepseek-ai/ DeepSeek-V3

work page 2025
[10]

Graphcore IPU architecture

Graphcore. Graphcore IPU architecture. Online. https://www.graphcore.ai/products/ipu

work page
[11]

Groq architecture

Groq. Groq architecture. Online. https://groq.com/

work page
[12]

Priority sampling of large language models for compilers, 2024

Dejan Grubisic, Chris Cummins, Volker Seeker, and Hugh Leather. Priority sampling of large language models for compilers, 2024. URL https://arxiv.org/abs/2402.18734. 11

work page arXiv 2024
[13]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. URL https://arxiv.org/ abs/1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023. URL https://arxiv.org/abs/2304.01433

work page arXiv 2023
[15]

Flashattention minimal

Peter Kim. Flashattention minimal. Online, 2024. https://github.com/tspeterkim/ flash-attention-minimal

work page 2024
[16]

The stack: 3 tb of permissively licensed source code, 2022

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022. URL https: //arxiv.org/abs/2211.15533

work page arXiv 2022
[17]

Ds-1000: A natural and reliable benchmark for data science code generation, 2022

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation, 2022. URL https://arxiv.org/abs/2211.11501

work page arXiv 2022
[18]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...

work page doi:10.1126/science.abq1158 2022
[20]

Christian J. Mills. Cuda mode notes - lecture 004. Online, 2024. https://christianjmills.com/ posts/cuda-mode-notes/lecture-004/

work page 2024
[21]

Performance-aligned llms for generating fast code, 2024

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code, 2024. URL https://arxiv.org/abs/2404. 18864

work page 2024
[22]

cudnn: Gpu-accelerated library for deep neural networks, 2014

NVIDIA. cudnn: Gpu-accelerated library for deep neural networks, 2014. URL https://developer. nvidia.com/cudnn

work page 2014
[23]

Cuda templates for linear algebra subroutines, 2017

NVIDIA. Cuda templates for linear algebra subroutines, 2017. URL https://github.com/NVIDIA/ cutlass

work page 2017
[24]

Nvidia Tesla V100 GPU architecture, 2017

NVIDIA. Nvidia Tesla V100 GPU architecture, 2017

work page 2017
[25]

Nvidia A100 tensor core GPU architecture, 2020

NVIDIA. Nvidia A100 tensor core GPU architecture, 2020

work page 2020
[26]

Nvidia H100 tensor core GPU architecture, 2022

NVIDIA. Nvidia H100 tensor core GPU architecture, 2022. 12

work page 2022
[27]

cuBLAS, 2023

NVIDIA. cuBLAS, 2023. URL https://docs.nvidia.com/cuda/cublas/

work page 2023
[28]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨ opf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfor...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, and Jiaming et al. Kong. Rwkv: Reinventing rnns for the transformer era. Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023

work page 2023
[30]

Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/ abs/2407.08608

work page arXiv 2024
[31]

Can language models solve olympiad programming?, 2024

Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?, 2024. URL https://arxiv.org/abs/2404.10952

work page arXiv 2024
[32]

Thunderkittens: Simple, fast, and adorable ai kernels

Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Fu, and Christopher R´ e. Thunderkittens: Simple, fast, and adorable ai kernels. International Conference on Learning Representations (ICLR) , 2024

work page 2024
[33]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022

work page 2022
[34]

FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024

Team PyTorch, Horace He, Driss Guessous, Yanbo Liang, and Joy Dong. FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024. URL https://pytorch.org/blog/ flexattention/

work page 2024
[35]

Ahmed, Amir Yazdanbakhsh, and Ali Jannesari

Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, and Ali Jannesari. Coderosetta: Pushing the boundaries of unsupervised code translation for parallel programming. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/forum?id=V6hrg4O9gg

work page 2024
[36]

Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019

work page 2019
[37]

Alan M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceed- ings of the London Mathematical Society , 2(42):230–265, 1936. URL http://www.cs.helsinki.fi/u/ gionis/cc05/OnComputableNumbers.pdf

work page 1936
[38]

Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S

Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S. Vetter. Comparing llama-2 and gpt-3 llms for hpc kernels generation, 2023. URL https://arxiv.org/abs/2309.07103

work page arXiv 2023
[39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017) , 2017

work page 2017
[40]

Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15362–15376, Miami, Florida, USA,...

work page doi:10.18653/v1/2024.emnlp-main.859 2024
[41]

BabelTower: Learning to auto-parallelized program translation

Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, and Yunji Chen. BabelTower: Learning to auto-parallelized program translation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Co...

work page 2022
[42]

Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts, 2024

Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. Re-bench: Evaluating frontier ai r&d c...

work page arXiv 2024
[43]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URL https://arxiv.org/abs/2410.03859

work page arXiv 2024
[45]

Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024

Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/sustcsonglin/ flash-linear-attention

work page 2024
[46]

" " 6 Simple model that p erf or ms a single matrix m u l t i p l i c a t i o n ( C = A * B ) with a large K ,→ d i m e n s i o n 7

Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, and Charles Sutton. Natural language to code generation in interactive data science notebooks, 2022. URL https://arxiv.org/abs/2212. 09248. 14 A KernelBench Task Example Here we provide an example t...

work page 2022