PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3
pith:AARY37WB Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{AARY37WB}
Prints a linked pith:AARY37WB badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Current LLMs produce code that is functionally correct but far from expert-optimized on system-level performance tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PerfCodeBench consists of tasks that demand system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks. When evaluated on a broad set of state-of-the-art LLMs, the generated code exhibits a clear gap relative to expert-optimized reference solutions, with the largest shortfalls appearing on parallelism and GPU tasks. Models also show inconsistent cross-language behavior and rarely match expert efficiency levels.
What carries the argument
PerfCodeBench, the executable benchmark that pairs each task with correctness verification, a baseline implementation, and a reference optimized solution to quantify runtime efficiency.
If this is right
- LLMs must improve specifically on parallelism and GPU operations to close the efficiency gap.
- Code-generation evaluation should incorporate runtime performance metrics in addition to functional correctness.
- Cross-language robustness remains a clear weakness that limits practical deployment.
- Performance-aware benchmarks are required to steer future model development toward efficient systems software.
Where Pith is reading between the lines
- The benchmark could be used to create targeted fine-tuning datasets focused on optimization reasoning.
- Integrating hardware simulation feedback into model training might narrow the observed gaps over time.
- Similar task designs could be applied to other hardware platforms to test broader claims about model limitations.
Load-bearing premise
The selected tasks accurately capture the realistic system-level choices, hardware-aware optimizations, and performance bottlenecks that matter in practice.
What would settle it
An LLM that consistently produces code matching or beating the reference optimized solutions on the benchmark tasks across multiple runs would falsify the claimed performance gap.
Figures
read the original abstract
Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. Tasks require system-level implementation choices, hardware-aware optimizations, and handling of performance bottlenecks; each includes executable correctness checks, a baseline implementation, and a reference optimized solution. Evaluation across state-of-the-art LLMs reports a clear gap versus expert-optimized code, largest on parallelism and GPU tasks, plus weaknesses in cross-language robustness and consistent efficiency.
Significance. If the tasks prove representative of real production bottlenecks, the benchmark would usefully demonstrate that current LLMs remain limited for performance-critical systems code and motivate performance-aware evaluation beyond functional correctness. Public release of the benchmark data, evaluation infrastructure, and complete LLM-generated code logs is a clear strength that supports reproducibility and follow-on work.
major comments (2)
- [Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.
- [Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.
minor comments (2)
- [Abstract] Abstract: 'performance-aware evaluation are still needed' is grammatically incorrect; should read 'performance-aware evaluations are still needed' or 'performance-aware evaluation is still needed'.
- [Abstract] The anonymous repository link is provided but should be replaced with a permanent archive (e.g., Zenodo) before publication to ensure long-term accessibility of the logs and infrastructure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how to better support the paper's claims. We address each major point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a 'clear gap' that is 'especially large on tasks involving parallelism and GPU operations' depends on the tasks accurately encoding realistic implementation decisions and bottlenecks. No information is supplied on task provenance (real kernels vs. synthetic), the fraction of tasks targeting GPU/parallelism, the magnitude of reference-vs-baseline speedups, or any external validation that the chosen bottlenecks are representative rather than selected for effect.
Authors: We agree that additional details on task construction are required to substantiate the central claims. In the revised manuscript we will add a dedicated subsection on benchmark construction that describes task provenance (drawn from documented performance-critical kernels in the HPC literature), the fraction of tasks involving parallelism and GPU operations, the observed reference-versus-baseline speedups, and the selection rationale based on commonly reported systems bottlenecks. revision: yes
-
Referee: [Evaluation] Evaluation (throughout): the reported gaps lack visible statistical controls for LLM output variability (e.g., multiple samples per prompt, temperature settings, or confidence intervals on runtime metrics). Without these, selection effects cannot be ruled out and the headline performance gap remains difficult to interpret.
Authors: We acknowledge the need for statistical controls. The revised evaluation section will report results aggregated over multiple samples per prompt, specify the temperature and sampling settings, and include standard deviations together with confidence intervals on the runtime metrics. revision: yes
Circularity Check
No circularity: benchmark tasks and metrics are independently defined with public references.
full rationale
The paper introduces PerfCodeBench as a new executable benchmark containing baseline implementations, reference optimized solutions, and correctness checks for each task. Performance gaps are measured directly by comparing LLM outputs against these external references on runtime efficiency, with no equations, fitted parameters, or derivations that reduce the reported gaps to self-defined quantities. No self-citations are invoked to justify task selection, uniqueness of the metric, or the emphasis on parallelism/GPU tasks. The claims rest on the provided task set and submitted public logs rather than any self-referential construction or renaming of prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Executable correctness checks and runtime comparisons to reference solutions can reliably measure optimization quality.
Reference graph
Works this paper leans on
-
[1]
Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025
work page 2025
-
[2]
Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026
work page 2026
-
[3]
Anthropic. Claude model overview. https://platform.claude.com/docs/en/ about-claude/models/overview, 2026
work page 2026
-
[4]
Understanding software engineering agents: A study of thought-action-result trajectories
Islem Bouzenia and Michael Pradel. Understanding software engineering agents: A study of thought-action-result trajectories. InASE, pages 2846–2857. IEEE, 2025
work page 2025
-
[5]
Seed2.0.https://seed.bytedance.com/en/seed2, 2026
ByteDance Seed. Seed2.0.https://seed.bytedance.com/en/seed2, 2026
work page 2026
-
[6]
Jason Chou, Ao Liu, Yuchi Deng, Zhiying Zeng, Tao Zhang, Haotian Zhu, Jianwei Cai, Yue Mao, Chenchen Zhang, Lingyun Tan, Ziyan Xu, Bohui Zhai, Hengyi Liu, Speed Zhu, Wiggin Zhou, and Fengzong Lian. Autocodebench: Large language models are automatic code benchmark generators.CoRR, abs/2508.09101, 2025
-
[7]
DeepSeek. DeepSeek V4 Preview Release. https://api-docs.deepseek.com/news/ news260424, 2026
work page 2026
-
[8]
A Survey on Code Generation with LLM-based Agents
Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin, and Ge Li. A survey on code generation with llm-based agents.CoRR, abs/2508.00083, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
CodeArena: A collective evaluation platform for LLM code generation
Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See-Kiong Ng. CodeArena: A collective evaluation platform for LLM code generation. InACL (3), pages 502–512. Association for Computational Linguistics, 2025
work page 2025
-
[10]
Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev. Eval- uating agents.md: Are repository-level context files helpful for coding agents?, 2026. URL https://arxiv.org/abs/2602.11988
-
[11]
Gemini 3.1 Pro: A smarter model for your most complex tasks
Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, 2026
work page 2026
-
[12]
Gemma 4: Byte for byte, the most capable open models
Google. Gemma 4: Byte for byte, the most capable open models. https://blog.google/ innovation-and-ai/technology/developers-tools/gemma-4/, 2026
work page 2026
-
[13]
Google AI for Developers. Gemini API models. https://ai.google.dev/gemini-api/ docs/models, 2026
work page 2026
-
[14]
Google AI for Developers. Gemma 4 model overview. https://ai.google.dev/gemma/ docs/core, 2026
work page 2026
-
[15]
EffiBench: Benchmarking the efficiency of automatically generated code
Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. EffiBench: Benchmarking the efficiency of automatically generated code. InNeurIPS, 2024
work page 2024
-
[16]
LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and con- tamination free evaluation of large language models for code. InICLR. OpenReview.net, 2025
work page 2025
-
[17]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world github issues? InICLR. OpenReview.net, 2024
work page 2024
-
[18]
Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019
Geoff Langdale and Daniel Lemire. Parsing gigabytes of JSON per second.arXiv preprint arXiv:1902.08318, 2019. URLhttps://arxiv.org/abs/1902.08318. 11
-
[19]
Jia Li, Hongyi Deng, Yiran Zhang, Kechi Zhang, Tianqi Shao, Tiankuo Zhao, Weinan Wang, Zhi Jin, Ge Li, Yang Liu, Yingtao Fang, and Yihong Dong. Realbench: A repo-level code generation benchmark aligned with real-world software development practices, 2026. URL https://arxiv.org/abs/2604.22659
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation
Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. FEA-Bench: A benchmark for evaluating repository-level code generation for feature implementation. InACL (1), pages 17160–17176. Association for Computational Linguistics, 2025
work page 2025
-
[21]
Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InACL (Findings), Findings of ACL, pages 20205–20221. Association for Computational Linguistics, 2025
work page 2025
-
[22]
LZ4: Extremely fast compression algorithm
LZ4 Contributors. LZ4: Extremely fast compression algorithm. https://github.com/lz4/ lz4, 2026
work page 2026
-
[23]
The Llama 4 herd: The beginning of a new era of natively multimodal AI
Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI. https: //ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025
work page 2025
-
[24]
Llama 4 models.https://www.llama.com/models/llama-4/, 2026
Meta Llama. Llama 4 models.https://www.llama.com/models/llama-4/, 2026
work page 2026
-
[25]
Moonshot AI. Kimi K2.6 quickstart. https://platform.kimi.ai/docs/guide/ kimi-k2-6-quickstart, 2026
work page 2026
-
[26]
Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026
Moonshot AI. Kimi AI with K2.6: Better coding, smarter agents.https://www.kimi.com/en, 2026
work page 2026
-
[27]
CUB: Reusable software components for the CUDA programming model
NVIDIA. CUB: Reusable software components for the CUDA programming model. https: //github.com/NVIDIA/cub, 2026
work page 2026
-
[28]
CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026
NVIDIA. CUDA Samples.https://github.com/NVIDIA/cuda-samples, 2026
work page 2026
-
[29]
Thrust: The C++ parallel algorithms library
NVIDIA. Thrust: The C++ parallel algorithms library. https://github.com/NVIDIA/ thrust, 2026
work page 2026
-
[30]
Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026
OpenAI. Codex: AI Coding Partner from OpenAI.https://openai.com/codex/, 2026
work page 2026
-
[31]
OpenAI. Introducing GPT-5.5. https://openai.com/index/introducing-gpt-5-5/ , 2026
work page 2026
-
[32]
OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026
OpenClaw. OpenClaw: Personal AI Assistant.https://openclaw.ai/, 2026
work page 2026
-
[33]
GPT-5 - api pricing and providers
OpenRouter. GPT-5 - api pricing and providers. https://openrouter.ai/openai/gpt-5, 2025
work page 2025
-
[34]
GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5
OpenRouter. GPT-5.4 - api pricing and providers.https://openrouter.ai/openai/gpt-5. 4, 2026
work page 2026
-
[35]
Seed-2.0-Mini - api pricing and providers
OpenRouter. Seed-2.0-Mini - api pricing and providers. https://openrouter.ai/ bytedance-seed/seed-2.0-mini, 2026
work page 2026
-
[36]
COFFE: A code efficiency benchmark for code generation.Proc
Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. COFFE: A code efficiency benchmark for code generation.Proc. ACM Softw. Eng., 2(FSE):242–265, 2025
work page 2025
-
[37]
Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan
Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, and Luu Anh Tuan. EffiBench- X: A multi-language benchmark for measuring efficiency of llm-generated code.CoRR, abs/2505.13004, 2025
-
[38]
Qwen3.6 model family.https://qwen.ai/, 2026
Qwen Team. Qwen3.6 model family.https://qwen.ai/, 2026
work page 2026
-
[39]
Qwen3.6-35B-A3B: Agentic coding power, now open to all
Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. https://qwen.ai/ blog?id=qwen3.6-35b-a3b, 2026. 12
work page 2026
-
[40]
simdutf Contributors. simdutf: Unicode validation and transcoding at billions of characters per second.https://github.com/simdutf/simdutf, 2026
work page 2026
- [41]
-
[42]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. In IC...
work page 2025
-
[43]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Live-SWE- agent: Can software engineering agents self-evolve on the fly?CoRR, abs/2511.13646, 2025
Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-SWE- agent: Can software engineering agents self-evolve on the fly?CoRR, abs/2511.13646, 2025
-
[45]
Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025
Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025
-
[46]
xxHash: Extremely fast non-cryptographic hash algorithm
xxHash Contributors. xxHash: Extremely fast non-cryptographic hash algorithm. https: //github.com/Cyan4973/xxHash, 2026
work page 2026
-
[47]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InNeurIPS, 2024
work page 2024
-
[48]
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents.CoRR, abs/2504.21798, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
yyjson: A high performance JSON library written in ANSI C
yyjson Contributors. yyjson: A high performance JSON library written in ANSI C. https: //github.com/ibireme/yyjson, 2026
work page 2026
-
[50]
AutoCodeRover: Au- tonomous program improvement
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InISSTA, pages 1592–1604. ACM, 2024
work page 2024
-
[51]
BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, and et al. BigCodeBench: Benchmarking code generation with diverse function...
work page 2025
-
[52]
true"); Reference. int up = row.indexOf(
Zstandard Contributors. Zstandard: Fast real-time compression algorithm. https://github. com/facebook/zstd, 2026. 13 A Data Sources This appendix lists public sources used to build PerfCodeBench. These sources provide realistic systems workloads. They also provide executable benchmark designs and optimization motifs for task construction. The source pool ...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.