pith. sign in

arxiv: 2603.08721 · v2 · pith:CEE27KQ3new · submitted 2026-02-10 · 💻 cs.AR · cs.LG· cs.SE

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

classification 💻 cs.AR cs.LGcs.SE
keywords kernelsemerginghardwarekernelcraftacceleratorsacrossisasagent
0
0 comments X
read the original abstract

New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels, a time-consuming and error-prone process that does not scale across hardware targets. This delays emerging hardware platforms from reaching the market. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark for evaluating an LLM agent's ability to generate and optimize low-level kernels for customized accelerators through a function-calling, feedback-driven workflow. We evaluate agent performance across three emerging accelerators on more than 20 machine-learning tasks, each with five diverse task configurations. Across four leading reasoning models, the strongest agents generate functionally correct kernels for unseen ISAs within a few refinement steps and produce optimized kernels that match or outperform compiler baselines. These results demonstrate KernelCraft's potential to accelerate the accelerator chip development cycle. KernelCraft is available at https://kernelcraft-cam.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FastKernels: Benchmarking GPU Kernel Generation in Production

    cs.LG 2026-05 conditional novelty 8.0

    FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.

  2. TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks

    cs.LG 2026-05 unverdicted novelty 7.0

    TriAxialKV introduces triaxial mixed-precision KV-cache quantization that matches BF16 accuracy at 4.5x cache size and 30% higher throughput for a Qwen3-VL agent on OSWorld.

  3. ATLAAS: Automatic Tensor-Level Abstraction of Accelerator Semantics

    cs.AR 2026-04 unverdicted novelty 7.0

    ATLAAS automatically converts RTL-extracted bit-level accelerator semantics into tensor-level ISA specs via an 8-pass MLIR pipeline, enabling automated compiler backend generation for designs like Gemmini and VTA.

  4. Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

    cs.LG 2026-05 unverdicted novelty 6.0

    Metal-Sci is a benchmark and harness for LLM evolutionary optimization of Apple Silicon Metal kernels that uses held-out sizes to detect silent regressions missed by in-distribution scores.

  5. MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

    cs.AR 2026-04 unverdicted novelty 6.0

    MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.