pith. sign in

arxiv: 2605.17978 · v1 · pith:KKCQ5ZSLnew · submitted 2026-05-18 · 💻 cs.CL

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

Pith reviewed 2026-05-20 11:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords vectorizationSIMD intrinsicsLLM code generationreinforcement learningauto-vectorizationhigh-performance computingexplicit vectorization
0
0 comments X

The pith

An 8B LLM trained via data synthesis and reinforcement learning generates explicit SIMD vectorized code that reaches state-of-the-art results and sometimes exceeds -O3 compiler output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that large language models can be equipped to handle explicit vectorization, the process of writing code that directly uses SIMD hardware instructions to process multiple data elements at once. The approach relies on an automated pipeline that creates training examples rich in intrinsic knowledge and a reinforcement learning stage that scores outputs according to actual runtime speed while keeping results correct. A sympathetic reader would care because many performance-critical programs in science and machine learning still depend on vectorization that compilers often handle conservatively, leaving speed on the table. If the method works, it opens a route to automated production of low-level efficient code without every developer needing to master hardware details.

Core claim

The central claim is that the combination of an automated synthesis pipeline for domain-specific intrinsic data and a reinforcement learning process that rewards measured execution efficiency allows an 8B model to achieve leading performance on the SSE and AVX portions of relevant benchmarks, with some generated implementations running faster than code produced under standard -O3 optimization.

What carries the argument

VecPrompt, the automated pipeline that synthesizes training data embedding knowledge of hardware intrinsics, together with VecRL, the reinforcement learning component that aligns generated code to actual runtime performance and semantic correctness.

If this is right

  • LLMs become capable of producing low-level hardware-specific code that traditional compilers cannot reliably generate through static analysis.
  • Developers gain access to vectorized implementations that match or beat hand-tuned or compiler-optimized versions without writing intrinsics themselves.
  • The same synthesis-plus-reinforcement pattern can be reused for other hardware-constrained code tasks where efficiency must be verified by execution.
  • Benchmarks focused on vector instructions can serve as reliable training signals for improving model performance in high-performance computing domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training pattern might transfer to generating optimized code for other instruction sets such as NEON or GPU primitives.
  • Integration into everyday coding tools could reduce the expert effort needed to reach near-optimal performance in compute-heavy applications.
  • Iterative loops that feed measured runtime back into further training rounds could tighten the connection between model output and real hardware gains.

Load-bearing premise

The reinforcement learning step must reward genuinely faster and still correct code rather than allowing the model to exploit test-specific shortcuts or produce functionally wrong results that happen to look fast on the evaluation suite.

What would settle it

Running the generated implementations on new input sizes, different CPU models, or with additional correctness checks to determine whether the reported speed gains remain consistent and the outputs stay accurate.

Figures

Figures reproduced from arXiv: 2605.17978 by Maosong Sun, Qi Shi, Shangzhan Li, Ting Liu, Wanxiang Che, Xinyu Yin, Xuanyu Jin, Xu Han, Ye He, Yuxin Zhou, Yuxuan Li.

Figure 1
Figure 1. Figure 1: An example of explicit vectorization. From [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AUTOVECCODER framework, which integrates knowledge-augmented data synthesis (VECPROMPT) and performance-driven reinforcement learning (VECRL) to enhance LLMs for explicit vectorization tasks. high-performance, high-reliability explicitly vector￾ized code holds significant academic and industrial value. Recent advancements in Large Language Mod￾els (LLMs) (Joel et al., 2025; Zhang et al., 20… view at source ↗
Figure 3
Figure 3. Figure 3: Performance evolution of AUTOVECCODER￾8B during VECRL, evaluated on the validation set every 20 optimization steps across 5 epochs. No smoothing is applied. lence. This underscores the advantage of our frame￾work in navigating the correctness–performance trade-off, ensuring that generated code is not only fast but also reliable for production use. 5.2 Results Analysis 5.2.1 Performance Beyond -O3 We analyz… view at source ↗
Figure 5
Figure 5. Figure 5: reveals a striking difference in optimiza￾tion trajectories. In the early stages of training (approx. step 10), NSR leads to a temporary surge in both correctness and fast1. 0 10 20 30 40 50 RL Training Steps 60 62 64 66 68 70 Correctness (%) NSR VecRL [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts used for distillation and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of the role of RAG [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study of mask-based control flow pattern learned by A [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study of handling non-deterministic iterations learned by A [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case study of semantic dependency resolution learned by A [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study of memory access restructuring pattern learned by A [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AutoVecCoder, a framework with two components: VecPrompt, an automated pipeline for synthesizing data that injects knowledge of SIMD intrinsics into LLMs, and VecRL, a reinforcement learning stage that further aligns generated code with execution efficiency. The central claim is that an 8B model trained under this framework reaches SOTA on the SSE and AVX subsets of SimdBench and, in some cases, produces vectorized implementations that outperform standard -O3 compiler output.

Significance. If the reported speedups are shown to arise from semantically correct and generalizable intrinsics code rather than benchmark-specific artifacts, the work would offer a practical route to improving explicit vectorization beyond what static compilers achieve, with potential value for HPC code generation tasks where LLMs currently underperform.

major comments (2)
  1. [§3.2] §3.2 (VecRL): The reward is described as combining execution time with a correctness signal, yet the text provides no quantitative details on the number or diversity of test cases, differential testing coverage, or adversarial input generation used to verify functional equivalence. This is load-bearing for the claim that generated code both runs faster than -O3 and remains correct, because a narrow test suite would allow the policy to exploit input-size or alignment patterns present only in SimdBench.
  2. [§4.1 and Table 2] §4.1 and Table 2: The SOTA and -O3-surpassing results are presented without an accompanying error analysis, per-benchmark correctness verification statistics, or comparison against stronger baselines that include manual intrinsics or other LLM-based vectorizers. Without these, it is impossible to determine whether the reported gains are robust or confined to the specific evaluation harness.
minor comments (2)
  1. [Abstract] The abstract states that the model 'in some cases' surpasses -O3 but does not indicate the fraction of benchmarks or the magnitude of improvement; adding this quantification would improve clarity.
  2. [§3.2] Notation for the reward components in VecRL is introduced without an explicit equation; a single displayed equation would make the RL objective easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and have revised the paper accordingly to provide the requested details and analyses. We believe these changes improve the clarity and robustness of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (VecRL): The reward is described as combining execution time with a correctness signal, yet the text provides no quantitative details on the number or diversity of test cases, differential testing coverage, or adversarial input generation used to verify functional equivalence. This is load-bearing for the claim that generated code both runs faster than -O3 and remains correct, because a narrow test suite would allow the policy to exploit input-size or alignment patterns present only in SimdBench.

    Authors: We agree that quantitative details on the verification process are essential to support the correctness claims. In the revised manuscript, Section 3.2 has been expanded with a new paragraph and accompanying table that specifies: 512 test cases per kernel (drawn from a pool of 2000+ generated inputs), covering input sizes from 32 to 8192 elements, multiple alignments (including unaligned and misaligned cases), and data types. Differential testing is performed against both reference scalar implementations and -O3 outputs, achieving >92% branch coverage via instrumentation. Adversarial inputs are generated through a fuzzing loop (10k iterations per kernel using AFL-style mutation), and we report that no exploits of SimdBench-specific patterns were observed in the final policy. These additions directly address the concern about potential overfitting and confirm that the reward signal enforces generalizable correctness. revision: yes

  2. Referee: [§4.1 and Table 2] §4.1 and Table 2: The SOTA and -O3-surpassing results are presented without an accompanying error analysis, per-benchmark correctness verification statistics, or comparison against stronger baselines that include manual intrinsics or other LLM-based vectorizers. Without these, it is impossible to determine whether the reported gains are robust or confined to the specific evaluation harness.

    Authors: We acknowledge that the original presentation lacked sufficient supporting analysis. The revised §4.1 now includes a dedicated error analysis subsection reporting that 97.4% of generated codes pass functional equivalence checks on a held-out test set of 300 inputs per benchmark (distinct from training and SimdBench). Extended Table 2 provides per-benchmark pass rates and speedup breakdowns. We have added comparisons to manual intrinsics implementations (for the 12 kernels where hand-written versions exist in public repositories) and to other LLM-based approaches, including GPT-4 with few-shot prompting and a recent open-source vectorization LLM baseline. These results show consistent outperformance and indicate that the gains generalize beyond the original harness. We have also clarified that all reported numbers use the same evaluation protocol with strict timeout and correctness gates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline evaluated on external benchmarks

full rationale

The paper presents an empirical framework (VecPrompt data synthesis + VecRL reinforcement learning) that trains an LLM on synthesized data and optimizes via execution-time rewards against external compiler baselines and SimdBench. No mathematical derivations, equations, or first-principles claims are made that reduce to fitted parameters or self-definitions by construction. Performance claims are direct experimental outcomes on held-out benchmark subsets rather than predictions forced by internal fits. No load-bearing self-citations or uniqueness theorems are invoked in the provided description. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract, the paper introduces two new named components and relies on standard ML training assumptions; full parameter counts and axioms cannot be audited without the manuscript.

free parameters (1)
  • RL reward hyperparameters
    Parameters controlling the balance between execution speed and code correctness in VecRL are likely fitted or chosen during training.
axioms (1)
  • domain assumption Synthesized data from VecPrompt injects accurate domain-specific intrinsic knowledge into the LLM.
    Invoked in the description of the data synthesis pipeline as the foundation for subsequent RL training.
invented entities (2)
  • VecPrompt no independent evidence
    purpose: Automated pipeline to synthesize training data with explicit vector intrinsic knowledge.
    New component proposed to address data scarcity for vectorization tasks.
  • VecRL no independent evidence
    purpose: Reinforcement learning stage to align LLM outputs with measured execution efficiency.
    New component proposed to optimize beyond standard supervised training.

pith-pipeline@v0.9.0 · 5751 in / 1437 out tokens · 45720 ms · 2026-05-20T11:38:40.200186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 11 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    2025 , eprint=

    Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    SuperCoder: Assembly Program Superoptimization with Large Language Models , author=. 2025 , eprint=

  10. [10]

    2025 , eprint=

    KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=

  11. [11]

    2025 , eprint=

    AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning , author=. 2025 , eprint=

  13. [14]

    2025 , eprint=

    Towards Better Correctness and Efficiency in Code Generation , author=. 2025 , eprint=

  14. [15]

    2025 , eprint=

    VecTrans: Enhancing Compiler Auto-Vectorization through LLM-Assisted Code Transformations , author=. 2025 , eprint=

  15. [16]

    2025 , eprint=

    SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation , author=. 2025 , eprint=

  16. [17]

    2025 , eprint=

    VecIntrinBench: Benchmarking Cross-Architecture Intrinsic Code Migration for RISC-V Vector , author=. 2025 , eprint=

  17. [18]

    2025 , eprint=

    IntrinTrans: LLM-based Intrinsic Code Translator for RISC-V Vector , author=. 2025 , eprint=

  18. [19]

    2025 , eprint=

    ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning , author=. 2025 , eprint=

  19. [20]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  20. [21]

    and Wong, Tommy and Padua, David A

    Maleki, Saeed and Gao, Yaoqing and Garzarán, María J. and Wong, Tommy and Padua, David A. , booktitle=. An Evaluation of Vectorizing Compilers , year=

  21. [22]

    2024 , eprint=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

  22. [23]

    2025 , eprint=

    DeepSeek-V3 Technical Report , author=. 2025 , eprint=

  23. [26]

    2025 , note =

    Qwen3-Coder: Agentic Coding in the World , howpublished =. 2025 , note =

  24. [28]

    Grok 4 Fast , year =

  25. [29]

    Claude Sonnet: Hybrid Reasoning Frontier Model , year =

  26. [30]

    Introducing GPT-5 , year =

  27. [31]

    LLaMeSIMD: The Ultimate SIMD Intrinsic & Function Translation Benchmarking Suite , year =

  28. [32]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  29. [33]

    Advances in Neural Information Processing Systems , volume=

    Effibench: Benchmarking the efficiency of automatically generated code , author=. Advances in Neural Information Processing Systems , volume=

  30. [34]

    Advances in Neural Information Processing Systems , volume=

    Mercury: A code efficiency benchmark for code large language models , author=. Advances in Neural Information Processing Systems , volume=

  31. [35]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  32. [36]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  33. [37]

    2013 , publisher=

    ZeroMQ , author=. 2013 , publisher=

  34. [38]

    2025 , eprint=

    UltraRAG: A Modular and Automated Toolkit for Adaptive Retrieval-Augmented Generation , author=. 2025 , eprint=

  35. [39]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  36. [42]

    A microbenchmark support library , url =

    Google , year =. A microbenchmark support library , url =

  37. [43]

    Intel® Intrinsics Guide , url =

    Intel , year =. Intel® Intrinsics Guide , url =

  38. [44]

    SVE Optimization Guide , url =

    ARM , year =. SVE Optimization Guide , url =

  39. [45]

    and Vasudevan, Nalini and Wu, Youfeng , title =

    Baghsorkhi, Sara S. and Vasudevan, Nalini and Wu, Youfeng , title =. SIGPLAN Not. , month = jun, pages =. 2016 , issue_date =. doi:10.1145/2980983.2908111 , abstract =

  40. [48]

    Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =

    Mendis, Charith and Yang, Cambridge and Pu, Yewen and Amarasinghe, Saman and Carbin, Michael , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

  41. [49]

    2025 , eprint=

    A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages , author=. 2025 , eprint=

  42. [50]

    2024 , eprint=

    Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code , author=. 2024 , eprint=

  43. [51]

    and Henderson, R

    Nuzman, D. and Henderson, R. , booktitle=. Multi-platform auto-vectorization , year=

  44. [54]

    Faruk Akgul. 2013. ZeroMQ. Packt Publishing

  45. [55]

    Anthropic . 2025. https://www.anthropic.com/claude/sonnet Claude sonnet: Hybrid reasoning frontier model . https://www.anthropic.com/claude/sonnet. Accessed: 2025-12-30

  46. [56]

    ARM. 2025. https://developer.arm.com/documentation/102699/0100 Sve optimization guide . Accessed: 2025-12-30

  47. [57]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732

  48. [58]

    Baghsorkhi, Nalini Vasudevan, and Youfeng Wu

    Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. https://doi.org/10.1145/2908080.2908111 Flexvec: auto-vectorization for irregular loops . In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, page 697–710, New York, NY, USA. Association for Computing Machinery

  49. [59]

    Yishen Chen, Charith Mendis, Michael Carbin, and Saman Amarasinghe. 2021. https://doi.org/10.1145/3445814.3446692 Vegen: a vectorizer generator for simd and beyond . In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '21, page 902–914, New York, NY, USA. Association for ...

  50. [60]

    Yuxuan Chen, Dewen Guo, Sen Mei, Xinze Li, Hao Chen, Yishan Li, Yixuan Wang, Chaoyue Tang, Ruobing Wang, Dingjun Wu, Yukun Yan, Zhenghao Liu, Shi Yu, Zhiyuan Liu, and Maosong Sun. 2025 a . https://arxiv.org/abs/2504.08761 Ultrarag: A modular and automated toolkit for adaptive retrieval-augmented generation . Preprint, arXiv:2504.08761

  51. [61]

    Zhirong Chen, Kaiyan Chang, Zhuolin Li, Xinyang He, Chujie Chen, Cangyuan Li, Mengdi Wang, Haobo Xu, Yinhe Han, and Ying Wang. 2025 b . https://arxiv.org/abs/2507.04736 Chipseek-r1: Generating human-surpassing rtl with llm via hierarchical reward-driven reinforcement learning . Preprint, arXiv:2507.04736

  52. [62]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  53. [63]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others. 2025. https://arxiv.org/abs/2412.19437 Deepseek-v3 technical report . Preprint, arXiv:2412.19437

  54. [64]

    Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See-Kiong Ng. 2024. Mercury: A code efficiency benchmark for code large language models. Advances in Neural Information Processing Systems, 37:16601--16622

  55. [65]

    Mingzhe Du, Luu Anh Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, and See kiong Ng. 2025. https://arxiv.org/abs/2505.23387 Afterburner: Reinforcement learning facilitates self-improving code efficiency optimization . Preprint, arXiv:2505.23387

  56. [66]

    Yunlong Feng, Yang Xu, Xiao Xu, Binyuan Hui, and Junyang Lin. 2025. https://arxiv.org/abs/2508.20124 Towards better correctness and efficiency in code generation . Preprint, arXiv:2508.20124

  57. [67]

    Google. 2014. https://github.com/google/benchmark A microbenchmark support library . Originally released in 2014; accessed 2025

  58. [68]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  59. [69]

    Liutong Han, Chu Kang, Mingjie Xing, and Yanjun Wu. 2025 a . https://arxiv.org/abs/2511.18867 Vecintrinbench: Benchmarking cross-architecture intrinsic code migration for risc-v vector . Preprint, arXiv:2511.18867

  60. [70]

    Liutong Han, Zhiyuan Tan, Hongbin Zhang, Pengcheng Wang, Chu Kang, Mingjie Xing, and Yanjun Wu. 2025 b . https://arxiv.org/abs/2510.10119 Intrintrans: Llm-based intrinsic code translator for risc-v vector . Preprint, arXiv:2510.10119

  61. [71]

    Yibo He, Shuoran Zhao, Jiaming Huang, Yingjie Fu, Hao Yu, Cunjian Huang, and Tao Xie. 2025. https://arxiv.org/abs/2507.15224 Simdbench: Benchmarking large language models for simd-intrinsic code generation . Preprint, arXiv:2507.15224

  62. [72]

    Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie M Zhang. 2024. Effibench: Benchmarking the efficiency of automatically generated code. Advances in Neural Information Processing Systems, 37:11506--11544

  63. [73]

    Intel. 2025. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html Intel® intrinsics guide . Accessed: 2025-12-30

  64. [74]

    Sathvik Joel, Jie JW Wu, and Fatemeh H. Fard. 2025. https://arxiv.org/abs/2410.03981 A survey on llm-based code generation for low-resource and domain-specific programming languages . Preprint, arXiv:2410.03981

  65. [75]

    Jianling Li, ShangZhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, WangHaojie WangHaojie, Jianrong Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025 a . https://doi.org/10.18653/v1/2025.findings-acl.1183 T riton B ench: Benchmarking large language model capabilities for generating triton operators . In Findings of the Association for Com...

  66. [76]

    Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, and Maosong Sun. 2025 b . https://arxiv.org/abs/2507.05687 Autotriton: Automatic triton programming with reinforcement learning in llms . Preprint, arXiv:2507.05687

  67. [77]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556

  68. [78]

    Garzarán, Tommy Wong, and David A

    Saeed Maleki, Yaoqing Gao, María J. Garzarán, Tommy Wong, and David A. Padua. 2011. https://doi.org/10.1109/PACT.2011.68 An evaluation of vectorizing compilers . In 2011 International Conference on Parallel Architectures and Compilation Techniques, pages 372--382

  69. [79]

    Charith Mendis, Cambridge Yang, Yewen Pu, Saman Amarasinghe, and Michael Carbin. 2019. Compiler auto-vectorization with imitation learning. Curran Associates Inc., Red Hook, NY, USA

  70. [80]

    Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006 a . https://doi.org/10.1145/1133255.1133997 Auto-vectorization of interleaved data for simd . SIGPLAN Not., 41(6):132–143

  71. [81]

    Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006 b . https://doi.org/10.1145/1133981.1133997 Auto-vectorization of interleaved data for simd . In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '06, page 132–143, New York, NY, USA. Association for Computing Machinery

  72. [82]

    OpenAI . 2025. https://openai.com/index/introducing-gpt-5/ Introducing gpt-5 . https://openai.com/index/introducing-gpt-5/. Accessed: 2025-12-30

  73. [83]

    KernelBench: Can LLMs Write Efficient GPU Kernels?

    Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. https://arxiv.org/abs/2502.10517 Kernelbench: Can llms write efficient gpu kernels? Preprint, arXiv:2502.10517

  74. [84]

    Qwen Team . 2025. https://qwenlm.github.io/blog/qwen3-coder/ Qwen3-coder: Agentic coding in the world . Open source model release and technical blog. Available from https://qwenlm.github.io/blog/qwen3-coder/

  75. [85]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. https://arxiv.org/abs/2305.18290 Direct preference optimization: Your language model is secretly a reward model . Preprint, arXiv:2305.18290

  76. [86]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

  77. [87]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256

  78. [88]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008--3021

  79. [89]

    Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, and Chris Shum. 2025. https://arxiv.org/abs/2512.02551 Cuda-l2: Surpassing cublas performance for matrix multiplication through reinforcement learning . Preprint, arXiv:2512.02551

  80. [90]

    Jubi Taneja, Avery Laird, Cong Yan, Madan Musuvathi, and Shuvendu K. Lahiri. 2025. https://doi.org/10.1145/3696443.3708929 Llm-vectorizer: Llm-based verified loop vectorizer . In Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, CGO '25, page 137–149, New York, NY, USA. Association for Computing Machinery

Showing first 80 references.