KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Binglei Lou; Cheng Zhang; Erwei Wang; Haoran Wu; Jianyi Cheng; Jiayi Nie; Rika Antonova; Robert Mullins; Timothy M. Jones; Yao Lai

arxiv: 2603.08721 · v2 · pith:CEE27KQ3new · submitted 2026-02-10 · 💻 cs.AR · cs.LG· cs.SE

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Jiayi Nie , Haoran Wu , Yao Lai , Zeyu Cao , Cheng Zhang , Binglei Lou , Erwei Wang , Jianyi Cheng

show 4 more authors

Timothy M. Jones Robert Mullins Rika Antonova Yiren Zhao

This is my paper

classification 💻 cs.AR cs.LGcs.SE

keywords kernelsemerginghardwarekernelcraftacceleratorsacrossisasagent

0 comments

read the original abstract

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels, a time-consuming and error-prone process that does not scale across hardware targets. This delays emerging hardware platforms from reaching the market. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark for evaluating an LLM agent's ability to generate and optimize low-level kernels for customized accelerators through a function-calling, feedback-driven workflow. We evaluate agent performance across three emerging accelerators on more than 20 machine-learning tasks, each with five diverse task configurations. Across four leading reasoning models, the strongest agents generate functionally correct kernels for unseen ISAs within a few refinement steps and produce optimized kernels that match or outperform compiler baselines. These results demonstrate KernelCraft's potential to accelerate the accelerator chip development cycle. KernelCraft is available at https://kernelcraft-cam.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FastKernels: Benchmarking GPU Kernel Generation in Production
cs.LG 2026-05 conditional novelty 8.0

FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks
cs.LG 2026-05 unverdicted novelty 7.0

TriAxialKV introduces triaxial mixed-precision KV-cache quantization that matches BF16 accuracy at 4.5x cache size and 30% higher throughput for a Qwen3-VL agent on OSWorld.
ATLAAS: Automatic Tensor-Level Abstraction of Accelerator Semantics
cs.AR 2026-04 unverdicted novelty 7.0

ATLAAS automatically converts RTL-extracted bit-level accelerator semantics into tensor-level ISA specs via an 8-pass MLIR pipeline, enabling automated compiler backend generation for designs like Gemmini and VTA.
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
cs.LG 2026-05 unverdicted novelty 6.0

Metal-Sci is a benchmark and harness for LLM evolutionary optimization of Apple Silicon Metal kernels that uses held-out sizes to detect silent regressions missed by in-distribution scores.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.