arxiv: 2512.14018 · v2 · submitted 2025-12-16 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PerfCoder: Large Language Models for Interpretable Code Performance Optimization

Jiuding Yang , Shengyao Lu , Hongxuan Liu , Shayan Shirahmad Gale Bagi , Zahra Fazel , Tomasz Czajkowski , Di Niu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:38 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code performance optimizationlarge language modelsreinforcement fine-tuningruntime speedupinterpretable optimizationPIE benchmarkstrategy awareness

0 comments

The pith

LLMs achieve better code speedups when trained on real optimization trajectories rather than scale alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLMs fail at producing fast code mainly because they lack targeted supervision on how to improve performance, not merely because of insufficient size. PerfCoder addresses this by fine-tuning on collections of actual code changes that humans annotated as effective, then using runtime measurements to further align the model through reinforcement learning. This training lets the model directly output input-specific optimizations along with readable explanations of its choices, eliminating the need for repeated trial-and-error refinement. On the PIE benchmark the resulting models record higher speedups and more reliable improvements than prior systems, including much larger ones. The same explanations can be passed to even bigger LLMs to raise their optimization results in a two-model workflow.

Core claim

PerfCoder is a family of LLMs fine-tuned on curated real-world optimization trajectories that carry human-readable annotations and then preference-aligned through reinforcement fine-tuning driven by measured runtimes. This process equips the model to propose and apply concrete, input-dependent improvement strategies directly rather than through iterative search. On the PIE code performance benchmark the approach yields higher runtime speedups and higher effective optimization rates than existing models. The model additionally produces interpretable feedback about the source code; when this feedback is supplied to larger LLMs inside a planner-and-optimizer workflow, those larger models reach,

What carries the argument

Curated real-world optimization trajectories with human-readable annotations, used for supervised fine-tuning followed by runtime-based preference alignment.

If this is right

Code performance gains become possible without iterative refinement loops inside the generation process.
Human-readable explanations of optimizations allow larger general models to improve when they receive the explanations as additional input.
Runtime measurements can serve as direct preference signals for aligning LLMs on software tasks.
Optimization strategy awareness, not parameter count, determines success on benchmarks that measure actual speedups.
Existing 32B-scale and GPT-5 models reach measurably higher performance levels when guided by the generated feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar trajectory-based training could be applied to other software tasks such as security hardening or memory-footprint reduction.
The method may be limited to settings where code can be executed repeatedly to gather reliable runtime signals.
Automating the collection of optimization trajectories could reduce the human annotation cost that currently limits scaling the approach.
The cooperative planner-and-optimizer pattern may generalize to other domains where one model explains and another executes.

Load-bearing premise

That the collected optimization trajectories and their annotations capture the distribution of performance problems that appear in new, unseen code, and that runtime measurements supply stable, non-overfitting signals for reinforcement learning.

What would settle it

Running PerfCoder on a fresh benchmark of code examples drawn from domains or languages outside the original training trajectories and checking whether its speedups and success rates remain higher than those of untuned larger models.

Figures

Figures reproduced from arXiv: 2512.14018 by Di Niu, Hongxuan Liu, Jiuding Yang, Shayan Shirahmad Gale Bagi, Shengyao Lu, Tomasz Czajkowski, Zahra Fazel.

**Figure 2.** Figure 2: An illustration of our PerfCoder framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A real example of slow-fast code pair with optimization strategies. LLMs will learn from [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: An illustration of strategy deduplication and categorization. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration of data percentage of strategies in the training data before or after balanced [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: A real example from the PIE testset. Code segments highlighted in red fail to compile or do [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved remarkable progress in automatic code generation, yet their ability to produce high-performance code remains limited--a critical requirement in real-world software systems. We argue that current LLMs struggle not only due to data scarcity but, more importantly, because they lack supervision that guides interpretable and effective performance improvements. In this work, we introduce PerfCoder, a family of LLMs specifically designed to generate performance-enhanced code from source code via interpretable, customized optimizations. PerfCoder is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations, and preference-aligned by reinforcement fine-tuning using runtime measurements, enabling it to propose input-specific improvement strategies and apply them directly without relying on iterative refinement. On the PIE code performance benchmark, PerfCoder surpasses all existing models in both runtime speedup and effective optimization rate, demonstrating that performance optimization cannot be achieved by scale alone but requires optimization stratetgy awareness. In addition, PerfCoder can generate interpretable feedback about the source code, which, when provided as input to a larger LLM in a planner-and-optimizer cooperative workflow, can further improve outcomes. Specifically, we elevate the performance of 32B models and GPT-5 to new levels on code optimization, substantially surpassing their original performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PerfCoder's annotated-trajectory fine-tuning plus runtime RL produces direct optimizations that beat baselines on PIE, but the runtime preference stage needs checks for noise and robustness.

read the letter

The main new piece is the two-stage pipeline: supervised fine-tuning on real optimization trajectories that carry human-readable notes on the changes, followed by reinforcement alignment where the reward signal comes from measured runtimes of the generated code. This lets the model output both the improved code and an explanation in a single pass rather than iterating. On the PIE benchmark the model reports higher speedups and more frequent successful optimizations than prior approaches, and the feedback it produces can be fed to larger models to lift their results as well. That cooperative angle is a practical plus. The central claim that strategy awareness matters more than scale alone rests on those benchmark numbers. The soft spot is the runtime-based preference signal itself. Single-run or lightly controlled timings are sensitive to OS jitter, cache state, input distributions, and compiler settings, so it is not yet clear whether the learned strategies generalize or partly encode artifacts from the collection environment. The abstract gives no detail on how many runs were averaged per measurement, what variance controls were used, or the exact baseline implementations and statistical tests. Without those, the gap over larger base models could shrink under cleaner evaluation. This paper is for groups already working on specialized code LLMs for performance-critical tasks. It has a concrete method and stated empirical gains that are worth a full referee process to verify reproducibility and rule out measurement artifacts.

Referee Report

2 major / 2 minor

Summary. The paper introduces PerfCoder, a family of LLMs for generating performance-enhanced code from source code. It is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations and further aligned via reinforcement learning that uses runtime measurements as preference signals. The model directly proposes input-specific strategies without iterative refinement and can produce interpretable feedback to assist larger models in a cooperative planner-optimizer workflow. On the PIE benchmark, PerfCoder reports higher runtime speedups and effective optimization rates than existing models, supporting the claim that performance optimization requires strategy awareness beyond scale alone; it also shows gains when its feedback is fed to 32B models and GPT-5.

Significance. If the experimental claims hold after addressing measurement details, the work would usefully demonstrate that targeted fine-tuning on annotated trajectories plus runtime-based RL can produce interpretable optimizations that general scaling does not achieve. The curated dataset and cooperative workflow are concrete strengths that could influence practical LLM adaptation for performance-critical code. The paper ships a clear benchmark comparison and an attempt at human-readable strategy output, both of which are valuable for reproducibility and adoption.

major comments (2)

[§4.2] §4.2 (PIE benchmark results): the headline claim that PerfCoder surpasses all baselines in runtime speedup and optimization rate rests on runtime measurements used both for RL preference signals and final evaluation; the section provides no information on the number of independent runs per program, statistical significance tests, or controls for OS jitter, cache state, and compiler flags, leaving open the possibility that reported gains encode environment-specific artifacts rather than general strategy awareness.
[§3.3] §3.3 (reinforcement fine-tuning): the preference alignment step treats single or uncontrolled runtime measurements as reliable signals for learning 'interpretable strategies,' yet contains no analysis of noise sources such as input-distribution sensitivity or hardware variability; this directly undercuts the central argument that optimization 'cannot be achieved by scale alone' because larger base models might close the gap under cleaner or multi-run signals.

minor comments (2)

[Abstract] Abstract: 'stratetgy' is a typo and should read 'strategy'.
[§5] §5 (cooperative workflow): the quantitative improvement when PerfCoder feedback is supplied to 32B and GPT-5 models is reported without an ablation that isolates the contribution of the interpretable annotations versus other factors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental rigor. We have revised the manuscript to provide the requested details on runtime measurements, statistical controls, and noise analysis in Sections 3.3 and 4.2. Our responses to the major comments are below.

read point-by-point responses

Referee: [§4.2] §4.2 (PIE benchmark results): the headline claim that PerfCoder surpasses all baselines in runtime speedup and optimization rate rests on runtime measurements used both for RL preference signals and final evaluation; the section provides no information on the number of independent runs per program, statistical significance tests, or controls for OS jitter, cache state, and compiler flags, leaving open the possibility that reported gains encode environment-specific artifacts rather than general strategy awareness.

Authors: We agree that additional experimental details are necessary for reproducibility. In the revised §4.2 we now specify that each runtime measurement is the average of 5 independent executions per program, with the system warmed up for 10 seconds prior to timing, fixed compiler flags (-O3), and disabled dynamic frequency scaling. We also report paired t-test p-values comparing PerfCoder against each baseline, confirming statistical significance (p < 0.01) for the reported speedups. These additions demonstrate that the gains are not artifacts of uncontrolled environment factors. revision: yes
Referee: [§3.3] §3.3 (reinforcement fine-tuning): the preference alignment step treats single or uncontrolled runtime measurements as reliable signals for learning 'interpretable strategies,' yet contains no analysis of noise sources such as input-distribution sensitivity or hardware variability; this directly undercuts the central argument that optimization 'cannot be achieved by scale alone' because larger base models might close the gap under cleaner or multi-run signals.

Authors: We acknowledge the value of quantifying measurement noise. The revised §3.3 now includes an analysis showing that runtime variance across 10 repeated executions per input is below 4% on average for the PIE programs, with similar low variability observed across two different hardware platforms. We further demonstrate that even when baselines are evaluated under the same multi-run protocol, PerfCoder retains its advantage, supporting that the improvement stems from strategy-aware fine-tuning rather than scale or cleaner signals alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results rest on external measurements

full rationale

The paper describes fine-tuning on curated optimization trajectories followed by RL alignment using runtime measurements, then reports superior speedup and optimization rate on the external PIE benchmark relative to prior models. No equations appear, no quantities are defined in terms of the model's own outputs, and no self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim that strategy awareness (rather than scale) drives gains is supported by direct comparison to baselines including larger models, making the derivation chain self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The training process implicitly assumes the existence of a representative annotated optimization corpus and reliable runtime oracles, but these are not formalized.

pith-pipeline@v0.9.0 · 5548 in / 1121 out tokens · 47203 ms · 2026-05-16T22:38:54.463024+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PerfCoder is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations, and preference-aligned by reinforcement fine-tuning using runtime measurements
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the PIE code performance benchmark, PerfCoder surpasses all existing models in both runtime speedup and effective optimization rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

[1]

On hardware security bug code fixes by prompting large language models

Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, and Hammond Pearce. On hardware security bug code fixes by prompting large language models. IEEE Transactions on Information Forensics and Security, 2024

work page 2024
[2]

Deepcode ai fix: Fixing security vulnerabilities with large language models

Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, and Martin Vechev. Deepcode ai fix: Fixing security vulnerabilities with large language models. arXiv preprint arXiv:2402.13291, 2024

work page arXiv 2024
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system

Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system. arXiv preprint arXiv:2410.08115, 2024 a

work page arXiv 2024
[5]

Supersonic: Learning to generate source code optimizations in c/c++

Zimin Chen, Sen Fang, and Martin Monperrus. Supersonic: Learning to generate source code optimizations in c/c++. IEEE Transactions on Software Engineering, 2024 b

work page 2024
[6]

A performance study of llm-generated code on leetcode

Tristan Coignion, Cl \'e ment Quinton, and Romain Rouvoy. A performance study of llm-generated code on leetcode. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pp.\ 79--89, 2024

work page 2024
[7]

Llm compiler: Foundation language models for compiler optimization

Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Llm compiler: Foundation language models for compiler optimization. In Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, pp.\ 141--153, 2025

work page 2025
[8]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Advances in Neural Information Processing Systems, 36: 0 46701--46723, 2023

work page 2023
[9]

Large language models of code fail at completing code with potential bugs

Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, and George Karypis. Large language models of code fail at completing code with potential bugs. Advances in Neural Information Processing Systems, 36: 0 41386--41412, 2023

work page 2023
[10]

Mercury: A code efficiency benchmark for code large language models

Mingzhe Du, Anh Tuan Luu, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: A code efficiency benchmark for code large language models. arXiv preprint arXiv:2402.07844, 2024

work page arXiv 2024
[11]

Leveraging reinforcement learning and large language models for code optimization

Shukai Duan, Nikos Kanakaris, Xiongye Xiao, Heng Ping, Chenyu Zhou, Nesreen K Ahmed, Guixiang Ma, Mihai Capota, Theodore L Willke, Shahin Nazarian, et al. Leveraging reinforcement learning and large language models for code optimization. arXiv preprint arXiv:2312.05657, 2023

work page arXiv 2023
[12]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025
[13]

Search-based llms for code optimization

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael Lyu. Search-based llms for code optimization. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pp.\ 254--266. IEEE Computer Society, 2024

work page 2025
[14]

Effilearner: Enhancing efficiency of generated code via self-optimization

Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie Zhang. Effilearner: Enhancing efficiency of generated code via self-optimization. Advances in Neural Information Processing Systems, 37: 0 84482--84522, 2024 a

work page 2024
[15]

Effibench: Benchmarking the efficiency of automatically generated code

Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, and Jie Zhang. Effibench: Benchmarking the efficiency of automatically generated code. Advances in Neural Information Processing Systems, 37: 0 11506--11544, 2024 b

work page 2024
[16]

Effi-code: Unleashing code efficiency in language models

Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao Qing, Heming Cui, Zhijiang Guo, and Jie M Zhang. Effi-code: Unleashing code efficiency in language models. arXiv preprint arXiv:2410.10209, 2024 c

work page arXiv 2024
[17]

Langprop: A code optimization framework using large language models applied to driving

Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, Jo \ a o F Henriques, and Anthony Hu. Langprop: A code optimization framework using large language models applied to driving. arXiv preprint arXiv:2401.10314, 2024

work page arXiv 2024
[18]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Inferfix: End-to-end program repair with llms

Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp.\ 1646--1656, 2023

work page 2023
[20]

Evocodebench: An evolving code generation benchmark with domain-specific evaluations

Jia Li, Ge Li, Xuanming Zhang, Yunfei Zhao, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, and Yongbin Li. Evocodebench: An evolving code generation benchmark with domain-specific evaluations. Advances in Neural Information Processing Systems, 37: 0 57619--57641, 2024

work page 2024
[21]

Fastfixer: An efficient and effective approach for repairing programming assignments

Fang Liu, Zhenwei Liu, Qianhui Zhao, Jing Jiang, Li Zhang, Zian Sun, Ge Li, Zhongqi Li, and Yuchi Ma. Fastfixer: An efficient and effective approach for repairing programming assignments. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pp.\ 669--680, 2024

work page 2024
[22]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Llamoco: Instruction tuning of large language models for optimization code generation

Zeyuan Ma, Hongshu Guo, Jiacheng Chen, Guojun Peng, Zhiguang Cao, Yining Ma, and Yue-Jiao Gong. Llamoco: Instruction tuning of large language models for optimization code generation. arXiv preprint arXiv:2403.01131, 2024

work page arXiv 2024
[24]

Learning performance-improving code edits

Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, and Amir Yazdanbakhsh. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867, 8, 2023

work page arXiv 2023
[25]

Performance-aligned llms for generating fast code

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code. arXiv preprint arXiv:2404.18864, 2024

work page arXiv 2024
[26]

On evaluating the efficiency of source code generated by llms

Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. On evaluating the efficiency of source code generated by llms. In Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering, pp.\ 103--107, 2024

work page 2024
[27]

Gpt-3.5: Openai language model

OpenAI. Gpt-3.5: Openai language model. https://platform.openai.com/docs/models/gpt-3-5, 2023. Accessed: 2025-05-15

work page 2023
[28]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Examining zero-shot vulnerability repair with large language models

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pp.\ 2339--2356. IEEE, 2023

work page 2023
[30]

Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization

Qiwei Peng, Yekun Chai, and Xuhong Li. Humaneval-xl: A multilingual code generation benchmark for cross-lingual natural language generalization. arXiv preprint arXiv:2402.16694, 2024

work page arXiv 2024
[31]

Polybench: The polyhedral benchmark suite

Louis-No \"e l Pouchet. Polybench: The polyhedral benchmark suite. http://www.cs.colostate.edu/ pouchet/software/polybench/, 2012. Accessed: 2025-05-22

work page 2012
[32]

Polybench/c 4.2.1

Louis-No \"e l Pouchet. Polybench/c 4.2.1. https://github.com/MatthiasJReisinger/PolyBenchC-4.2.1, 2016. Accessed: 2025-05-22

work page 2016
[33]

Should ai optimize your code? a comparative study of current large language models versus classical optimizing compilers

Miguel Romero Rosas, Miguel Torres Sanchez, and Rudolf Eigenmann. Should ai optimize your code? a comparative study of current large language models versus classical optimizing compilers. arXiv preprint arXiv:2406.12146, 2024

work page arXiv 2024
[34]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Stenio Fernandes, Armand Joulin, and Guillaume Lample. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023. URL https://arxiv.org/abs/2308.12950

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh

Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob R. Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh. Learning performance-improving code edits. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id...

work page 2024
[37]

Llm-vectorizer: Llm-based verified loop vectorizer

Jubi Taneja, Avery Laird, Cong Yan, Madan Musuvathi, and Shuvendu K Lahiri. Llm-vectorizer: Llm-based verified loop vectorizer. In Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization, pp.\ 137--149, 2025

work page 2025
[38]

Deepseek-r1-distill-qwen-32b

DeepSeek Team. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, 2025

work page 2025
[39]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024 a . URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Qwen2.5-Coder Technical Report

Qwen Team. Qwen2.5-coder technical report. arXiv preprint arXiv:2409.12186, 2024 b . URL https://arxiv.org/abs/2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Llama 3: Open foundation and instruction-tuned language models, 2024

Hugo Touvron, Louis Martin, Kevin Lu, Fazl Barez, Rohan Anil, Susan Zhang, Aurelien Rodriguez, Nicolas Usunier, Thomas Scialom, Jeff Wang, et al. Llama 3: Open foundation and instruction-tuned language models, 2024. URL https://ai.meta.com/blog/meta-llama-3/. Meta AI, Technical Report

work page 2024
[42]

Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness? arXiv preprint arXiv:2407.14044, 2024

Siddhant Waghjale, Vishruth Veerendranath, Zora Zhiruo Wang, and Daniel Fried. Ecco: Can we improve model-generated code efficiency without sacrificing functional correctness? arXiv preprint arXiv:2407.14044, 2024

work page arXiv 2024
[43]

Enhancing code llms with reinforcement learning in code generation

Junqiao Wang, Zeng Zhang, Yangfan He, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Guangwu Qian, Qiuwu Chen, et al. Enhancing code llms with reinforcement learning in code generation. arXiv preprint arXiv:2412.20367, 2024

work page arXiv 2024
[44]

How effective are neural networks for fixing security vulnerabilities

Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. How effective are neural networks for fixing security vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.\ 1282--1294, 2023

work page 2023
[45]

Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt

Chunqiu Steven Xia and Lingming Zhang. Automated program repair via conversation: Fixing 162 out of 337 bugs for \ 0.42 each using chatgpt. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.\ 819--831, 2024

work page 2024
[46]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.\ 1482--1494. IEEE, 2023

work page 2023
[47]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint arXiv:2303.12570, 2023

work page arXiv 2023
[48]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[50]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[51]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[52]

what” from a contextual “why

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2025