pith. sign in

arxiv: 2606.28438 · v1 · pith:ACB6IZPAnew · submitted 2026-06-26 · 💻 cs.SE · cs.AI

When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs

Pith reviewed 2026-06-30 01:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords recursive self-trainingcode LLMsAI self-reviewmodel collapsedistributional reweightingself-gatingcode generation
0
0 comments X

The pith

AI self-review of code LLMs degenerates into rubber-stamp approval under recursive training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines recursive self-training loops in code LLMs, where AI-generated code can feed back into training data through repositories. It tests three regimes: no review, human-gate review with independent checks like compilation, and AI-self-gate review using the model's own signals such as perplexity or binary self-scoring. Across models and benchmarks, no review collapses fastest while human gates slow but do not halt decline; AI-self-gate filters initially mask problems but later accept poorer outputs. The authors model review as gated distributional reweighting and prove that self-gating reduces to ungated self-training once the model begins confirming its own outputs, producing rising acceptance scores alongside falling benchmark correctness. This matters because AI coding tools now generate code faster than human review can keep up, risking uncontrolled quality degradation in real codebases.

Core claim

Formulating review as gated distributional reweighting shows that AI self-gating degenerates to ungated self-training under a self-confirming acceptance condition. In this regime the binary self-gate enters a rubber-stamp state where acceptance scores rise while benchmark correctness falls. A spectral analysis of representation-level covariance concentration under recursive retraining supports the observed collapse.

What carries the argument

Gated distributional reweighting, in which review functions as a gate that reweights the training distribution, together with the self-confirming acceptance condition that renders the gate permanently open.

If this is right

  • No review produces the fastest performance collapse.
  • Human-gate filters using compilation and static quality checks slow but do not stop the decline.
  • AI-self-gate filters appear effective early yet later lose filtering power and enter the rubber-stamp regime.
  • Stable recursive code LLM training requires exogenous verification sources instead of model-coupled self-review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If public repositories increasingly contain AI-generated code accepted by similar models, the self-training loop could accelerate outside controlled benchmarks.
  • Replacing benchmark evaluation with tests on freshly written, non-contaminated code snippets could isolate whether the observed drop is real degradation or measurement artifact.
  • The same gated reweighting analysis may apply to recursive training in other generative domains that rely on model self-scoring for data filtering.

Load-bearing premise

Benchmark correctness remains a stable external measure of capability even after the model has been retrained on its own accepted outputs.

What would settle it

Perform multiple rounds of recursive fine-tuning on a code LLM using its binary self-scoring as the sole gate and measure whether average acceptance scores increase while held-out benchmark correctness decreases.

Figures

Figures reproduced from arXiv: 2606.28438 by Liang Zhao, Xinyuan Song, Zekun Cai.

Figure 1
Figure 1. Figure 1: Recursive self-training without vs. with gating. Left: ungated training reuses all generated code and can amplify errors. Right: gated training filters generated code through r(x, c); Human gates are exogenous, while AI self-gates are coupled to θt. The gated synthetic data distribution is therefore m gated t (x, c) := pX(x)qθt (c | x). (2.5) Training on accepted samples yields θ gated t+1 ∈ arg max θ E(x,… view at source ↗
Figure 2
Figure 2. Figure 2: Recursive self-training collapse overview: HumanEval+ and MBPP+ across gate types and models. Each panel shows all four model trajectories from Round 0 (pre-training baseline) to Round 5. Left column: Human gate (Compile + Quality average); right column: AI self-gate (Binary + PPL average). Top row: HumanEval+; bottom row: MBPP+. Dashed horizontal lines denote model-specific baselines. All models degrade s… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: HumanEval+ pass@1 heatmap over recursive rounds. Rows are model–filter pairs and columns are rounds 1–5. Brighter cells indicate higher pass@1, and annotations give exact values. Horizontal white lines separate model families. 0 1 2 3 4 5 7 10 Self-Training Round 0.00 0.05 0.10 0.15 0.19 0.20 0.25 pass@1 SantaCoder (pre-train) R5+ (a) HumanEval pass@1 slight recovery 0 1 2 3 4 5 7 10 Self-Training Round 0.… view at source ↗
Figure 5
Figure 5. Figure 5: Recursive self-training results on SantaCoder. HumanEval pass@1 (left) and MBPP pass@1 (right) across recursive fine-tuning rounds. Human-gate methods are Compile and Quality; AI-self-gate methods are Perplexity and Binary Classifier. The red dashed line denotes the SantaCoder pretraining baseline. 3.5 LiveCodeBench: Near-Zero Collapse [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Filtering strategies on SantaCoder: early vs. extended recursive training. Blue bars show pass@1 after round 1; orange bars show the extended 30k-step checkpoint where available. Background shading: white = no gate; green = Human gate; light blue = AI self-gate. 1 2 3 4 5 Round 0.0 0.2 0.4 0.6 0.8 1.0 Filter Pass Rate (avg across models) Filter Pass Rate Degradation Over Rounds Vanilla (no filter) Compile … view at source ↗
Figure 7
Figure 7. Figure 7: Filter pass rate over rounds, averaged across all four models. No-gate and Human￾gate regimes (Vanilla, Compile, Quality) maintain stable pass rates by construction or by fixed rules. AI self-gate methods (Binary, PPL) show increasing pass rates, indicating progressive loss of discriminative power—the rubber-stamp regime of Theorem 2.3. 3.7 Cross-Model Analysis [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LiveCodeBench collapse under Vanilla self-training. All four models approach near-zero pass@1 by Round 5. Qwen2.5-Coder collapses from 0.238 to 0.000; Code Llama-7B collapses from 0.045 to 0.003. Y-axis: [0, 0.30]. 0 9k 18k 27k 36k 45k Retraining Steps (k) 0.10 0.20 0.30 0.40 HumanEval pass@1 pre-train (a) HumanEval pass@1 0 9k 18k 27k 36k 45k Retraining Steps (k) 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 MB… view at source ↗
Figure 10
Figure 10. Figure 10: Cross-model HumanEval (a) and MBPP (b) collapse under vanilla recursive self￾retraining. All model families degrade monotonically from the pre-retraining baseline (step 0). Collapse is a systematic property of self-consuming retraining, independent of architecture. robustness. LiveCodeBench approaches near-zero for all models by Round 5 regardless of strategy, confirming that recursive self-training syste… view at source ↗
Figure 11
Figure 11. Figure 11: Per-strategy collapse trajectories across models. HumanEval and MBPP pass@1 over recursive self-training rounds for StarCoder2-3B (top row) and Qwen2.5-Coder-1.5B (bottom row) under all filtering strategies. Dashed red line indicates each model’s pre-training baseline. All strategies converge well below the pre-training baseline by round 5. 2013. doi: 10.1109/ICSE.2013.6606617. URL https://doi.org/10.1109… view at source ↗
Figure 12
Figure 12. Figure 12: Compile and execution-pass analysis across recursive retraining rounds. Results are computed on 500 generated samples with a 5s timeout. The figure compares ungated vanilla recursion with the compile-based Human gate [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SantaCoder per-strategy degradation trajectories. Each panel shows one filtering strategy. HumanEval pass@1 is shown with blue circles, and MBPP pass@1 is shown with orange squares. Dotted horizontal lines denote the pre-retraining baseline for each benchmark. The panels include Vanilla, Compile, Compile+Quality, Quality-20R, Perplexity, and Binary Classifier. by PPL is also unstable: it remains below the… view at source ↗
Figure 14
Figure 14. Figure 14: Appendix cross-model per-strategy degradation trajectories. HumanEval and MBPP pass@1 are reported over 10 self-training rounds for StarCoder2-3B and Qwen2.5-Coder-1.5B. The compared strategies are Vanilla, Compile, Perplexity, and Quality. Dashed red lines denote each model’s pretraining baseline. Evaluation and hardware. After each round, all methods are evaluated on HumanEval Chen et al. [2021], OpenAI… view at source ↗
Figure 15
Figure 15. Figure 15: Cross-model degradation trajectories for gated and extended vanilla runs. The top row reports compile-gated runs for Qwen2.5-Coder-1.5B and StarCoder2-3B. The bottom row reports extended vanilla runs for StarCoder and Qwen2.5-Coder-1.5B. Blue curves denote HumanEval, orange curves denote MBPP, and dotted lines denote pre-retraining baselines. 0 9k 18k 27k 36k 45k Retraining Steps (k) 0.10 0.20 0.30 0.40 H… view at source ↗
Figure 16
Figure 16. Figure 16: Appendix cross-model vanilla collapse on HumanEval and MBPP. The left panel reports HumanEval pass@1 and the right panel reports MBPP pass@1. All model families degrade substantially from the pre-retraining baseline at step 0. K Per-Round Full Results Tables (All Models) This appendix provides complete per-round results for all four models across all five filtering strategies and five rounds. All metrics … view at source ↗
read the original abstract

Recursive self-training can degrade neural generative models when generated data is reused without fresh human data or external quality control. We study this risk in code LLMs, where AI-generated code can enter real repositories, later become training data, and create a repository-scale self-training loop. While software development traditionally interrupts this loop through pull-request review, tests, compilation, and human approval, AI coding tools now produce code faster than humans can review it, and code review itself is increasingly automated by AI systems. We therefore compare three recursive fine-tuning regimes: no review, Human-gate review using model-independent filters such as compilation and static quality checks, and AI-self-gate review using the code LLM's own signals such as perplexity and binary self-scoring. Across multiple code LLMs and benchmarks, no review collapses fastest, Human-gate filters slow but do not stop collapse, and AI-self-gate filters can look strong early but later lose their filtering effect. In the clearest case, the binary self-gate enters a rubber-stamp regime where acceptance scores rise while benchmark correctness falls. We explain this behavior by formulating review as gated distributional reweighting, proving that AI self-gating degenerates to ungated self-training under a self-confirming acceptance condition, and giving a spectral analysis of representation-level covariance concentration under recursive retraining. These results suggest that stable recursive code LLM training requires exogenous verification rather than model-coupled self-review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines recursive self-training collapse in code LLMs by comparing three fine-tuning regimes—no review, Human-gate review (model-independent filters such as compilation and static checks), and AI-self-gate review (using the LLM's own signals like perplexity and binary self-scoring)—across multiple models and benchmarks. It reports that no review collapses fastest, Human-gate slows but does not stop collapse, and AI-self-gate initially appears effective but later enters a rubber-stamp regime with rising acceptance scores and falling benchmark correctness. The behavior is explained by formulating review as gated distributional reweighting, proving degeneration to ungated self-training under a self-confirming acceptance condition, and providing a spectral analysis of representation-level covariance concentration under recursive retraining. The conclusion advocates exogenous verification over model-coupled self-review.

Significance. If the empirical patterns and the degeneration proof hold, the work is significant for identifying a concrete failure mode in automated code review loops that can contaminate real repositories. The gated reweighting formulation and the explicit self-confirming condition provide a reusable theoretical lens, while the spectral analysis adds insight into representation collapse; together they move the discussion beyond purely empirical warnings about self-training.

major comments (2)
  1. [gated distributional reweighting / proof section] Proof of degeneration (gated distributional reweighting section): the self-confirming acceptance condition is presented as sufficient for collapse to ungated self-training, but the manuscript must verify this condition on held-out data separate from the recursive training distribution; if the condition is only checked on the same generated samples used to demonstrate rising acceptance and falling correctness, the argument risks circularity.
  2. [experiments / results on AI-self-gate] Experiments across models and benchmarks: the central observation that benchmark correctness falls while acceptance rises treats correctness as a stable external signal. The paper should report explicit controls (e.g., overlap statistics between generated code patterns and benchmark test cases, or evaluation on a temporally held-out benchmark version) to rule out the possibility that retraining shifts the evaluation distribution or contaminates the benchmarks themselves.
minor comments (2)
  1. [formulation section] Notation for the binary self-gate acceptance probability should be introduced once with a clear definition before its repeated use in the reweighting equations.
  2. [figures] Figure captions for the acceptance-vs-correctness plots should state the exact number of recursive steps and the precise definition of 'acceptance score' used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. Below we address each major comment point by point, indicating the changes we will make in revision.

read point-by-point responses
  1. Referee: [gated distributional reweighting / proof section] Proof of degeneration (gated distributional reweighting section): the self-confirming acceptance condition is presented as sufficient for collapse to ungated self-training, but the manuscript must verify this condition on held-out data separate from the recursive training distribution; if the condition is only checked on the same generated samples used to demonstrate rising acceptance and falling correctness, the argument risks circularity.

    Authors: We agree that the current presentation risks circularity if the self-confirming acceptance condition is verified exclusively on the same generated samples used to illustrate rising acceptance and falling correctness. In the revised manuscript we will add an explicit verification of the condition on a held-out collection of generated samples that were never used in any recursive training iteration. This will be reported alongside the existing proof to eliminate the circularity concern. revision: yes

  2. Referee: [experiments / results on AI-self-gate] Experiments across models and benchmarks: the central observation that benchmark correctness falls while acceptance rises treats correctness as a stable external signal. The paper should report explicit controls (e.g., overlap statistics between generated code patterns and benchmark test cases, or evaluation on a temporally held-out benchmark version) to rule out the possibility that retraining shifts the evaluation distribution or contaminates the benchmarks themselves.

    Authors: We accept that additional controls are required to confirm that the observed drop in benchmark correctness is not an artifact of distribution shift or benchmark contamination. In revision we will compute and report overlap statistics (token-level and AST-level) between the generated code patterns and the benchmark test cases. We will also add results on any temporally held-out benchmark versions that are available for the models studied. These controls will be placed in the experimental results section. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent mathematical formulation and external benchmarks

full rationale

The paper formulates review as gated distributional reweighting and proves degeneration to ungated self-training under a stated self-confirming acceptance condition, presenting this as a general result rather than a fit to the observed data. No quoted equations reduce a prediction to a fitted parameter by construction, no self-citations bear the central load, and no ansatz or uniqueness theorem is imported from prior author work. Benchmark correctness is treated as an external signal throughout, with the collapse observed across multiple models and regimes; the derivation chain remains self-contained against those measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the gated reweighting model and self-confirming condition are introduced as analytical tools rather than fitted quantities.

pith-pipeline@v0.9.1-grok · 5786 in / 1224 out tokens · 25760 ms · 2026-06-30T01:27:29.362285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

168 extracted references · 76 canonical work pages · 36 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Optimal brain damage , author=. Advances in neural information processing systems , volume=

  2. [2]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding , author=. arXiv preprint arXiv:1510.00149 , year=

  3. [3]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot , author =. arXiv preprint arXiv:2302.06590 , year =

  4. [4]

    2024 , howpublished =

    Research: Quantifying GitHub Copilot's Impact in the Enterprise with Accenture , author =. 2024 , howpublished =

  5. [5]

    2025 , eprint=

    Rethinking Code Review Workflows with LLM Assistance: An Empirical Study , author=. 2025 , eprint=

  6. [6]

    Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions

    Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions , author =. arXiv preprint arXiv:2508.18771 , year =. 2508.18771 , archivePrefix=

  7. [7]

    2026 , howpublished =

  8. [8]

    2026 , howpublished =

    60 Million Copilot Code Reviews and Counting , author =. 2026 , howpublished =

  9. [9]

    arXiv preprint arXiv:2603.26130 , year =

    SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback , author =. arXiv preprint arXiv:2603.26130 , year =

  10. [10]

    Proceedings of the 35th International Conference on Software Engineering , pages =

    Expectations, Outcomes, and Challenges of Modern Code Review , author =. Proceedings of the 35th International Conference on Software Engineering , pages =. 2013 , doi =

  11. [11]

    2015 IEEE/ACM 12th Working Conference on Mining Software Repositories , pages =

    Characteristics of Useful Code Reviews: An Empirical Study at Microsoft , author =. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories , pages =. 2015 , doi =

  12. [12]

    Empirical Software Engineering , volume =

    An Empirical Study of the Impact of Modern Code Review Practices on Software Quality , author =. Empirical Software Engineering , volume =. 2016 , doi =

  13. [13]

    Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice , pages =

    Modern Code Review: A Case Study at Google , author =. Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice , pages =. 2018 , doi =

  14. [14]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. 2023 , journal =. doi:10.48550/arXiv.2302.04761 , eprint =

  15. [15]

    PAL: Program-aided Language Models

    PAL: Program-aided Language Models , author =. 2022 , journal =. doi:10.48550/arXiv.2211.10435 , eprint =

  16. [16]

    ReAct: Synergizing Reasoning and Acting in Language Models

    ReAct: Synergizing Reasoning and Acting in Language Models , author =. 2023 , journal =. doi:10.48550/arXiv.2210.03629 , eprint =

  17. [17]

    2024 , eprint=

    The Curse of Recursion: Training on Generated Data Makes Models Forget , author=. 2024 , eprint=

  18. [18]

    Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers

    Raspanti, Federico and Ozcelebi, Tanir and Holenderski, Mike. Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 2025. doi:10.18653/v1/2025.acl-industry.34

  19. [19]

    2025 , eprint=

    Beyond ReAct: A Planner-Centric Framework for Complex Tool-Augmented LLM Reasoning , author=. 2025 , eprint=

  20. [20]

    2021 , howpublished =

    openai/human-eval: Code for the paper ``Evaluating Large Language Models Trained on Code'' , author =. 2021 , howpublished =

  21. [21]

    2022 , howpublished =

    bigcode-project/bigcode-evaluation-harness: A framework for the evaluation of code generation models , author =. 2022 , howpublished =

  22. [22]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , author =. arXiv preprint arXiv:2009.10297 , year =. 2009.10297 , archivePrefix=

  23. [23]

    BLEU : a method for automatic evaluation of machine translation

    Bleu: a Method for Automatic Evaluation of Machine Translation , author =. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =. 2002 , address =. doi:10.3115/1073083.1073135 , url =

  24. [24]

    Text Summarization Branches Out , pages =

    ROUGE: A Package for Automatic Evaluation of Summaries , author =. Text Summarization Branches Out , pages =. 2004 , address =

  25. [25]

    Popovi. chr. Proceedings of the Tenth Workshop on Statistical Machine Translation , pages =. 2015 , address =. doi:10.18653/v1/W15-3049 , url =

  26. [26]

    Measuring Coding Challenge Competence With

    Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , journal =. Measuring Coding Challenge Competence With. 2021 , eprint =

  27. [27]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast Transformer Decoding: One Write-Head is All You Need , author =. arXiv preprint arXiv:1911.02150 , year =. 1911.02150 , archivePrefix=

  28. [28]

    Wang, Jiexin and Luo, Xitong and Cao, Liuwen and He, Hongkui and Huang, Hailin and Xie, Jiayuan and Jatowt, Adam and Cai, Yi , journal =. Is Your. 2024 , eprint =

  29. [29]

    Program Synthesis with Large Language Models

    Program Synthesis with Large Language Models , author=. arXiv preprint arXiv:2108.07732 , year=

  30. [30]

    Liu and Balaji Lakshminarayanan , title =

    Jie Ren and Yao Zhao and Tu Vu and Peter J. Liu and Balaji Lakshminarayanan , title =. Proceedings of the Workshop on I Can't Believe It's Not Better at NeurIPS , series =. 2023 , publisher =

  31. [31]

    The Twelfth International Conference on Learning Representations , year =

    Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , title =. The Twelfth International Conference on Learning R...

  32. [32]

    Advances in Neural Information Processing Systems , volume =

    Noah Shinn and Federico Cassano and Ashwin Gopinath and Karthik Narasimhan and Shunyu Yao , title =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

  33. [33]

    Xing and Hao Zhang and Joseph E

    Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric P. Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , title =. NeurIPS 2023 Datasets and Benchmarks Track , year =

  34. [34]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Helia Hashemi and Jason Eisner and Corby Rosset and Benjamin Van Durme and Chris Kedzie , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =. doi:10.18653/v1/2024.acl-long.745 , url =

  35. [35]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    Weixi Tong and Tianyi Zhang , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =. doi:10.18653/v1/2024.emnlp-main.1118 , url =

  36. [36]

    Proceedings of the 31st International Conference on Computational Linguistics , pages =

    Yuwei Zhao and Ziyang Luo and Yuchen Tian and Hongzhan Lin and Weixiang Yan and Annan Li and Jing Ma , title =. Proceedings of the 31st International Conference on Computational Linguistics , pages =. 2025 , publisher =

  37. [37]

    Bowman and Shi Feng , title =

    Arjun Panickssery and Samuel R. Bowman and Shi Feng , title =. Advances in Neural Information Processing Systems , year =

  38. [38]

    arXiv preprint arXiv:2301.03988 , year=

    SantaCoder: don't reach for the stars! , author=. arXiv preprint arXiv:2301.03988 , year=

  39. [39]

    Transactions on Machine Learning Research , year=

    StarCoder: may the source be with you! , author=. Transactions on Machine Learning Research , year=

  40. [40]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence , author=. arXiv preprint arXiv:2406.11931 , year=

  41. [41]

    Qwen2.5-Coder Technical Report

    Qwen2.5-Coder Technical Report , author=. arXiv preprint arXiv:2409.12186 , year=

  42. [42]

    International Conference on Learning Representations , year=

    OctoPack: Instruction Tuning Code Large Language Models , author=. International Conference on Learning Representations , year=

  43. [43]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 4: Student Research Workshop , year=

    InstructCoder: Instruction Tuning Large Language Models for Code Editing , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Volume 4: Student Research Workshop , year=

  44. [44]

    Evaluation Best Practices , year =

  45. [45]

    2026 , howpublished =

    Demystifying evals for. 2026 , howpublished =

  46. [46]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  47. [47]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. arXiv preprint arXiv:2403.07974 , year=

  48. [48]

    arXiv preprint arXiv:2411.04905 , year =

    OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models , author =. arXiv preprint arXiv:2411.04905 , year =. 2411.04905 , archivePrefix=

  49. [49]

    arXiv preprint arXiv:2510.16579 , year =

    Human-Aligned Code Readability Assessment with Large Language Models , author =. arXiv preprint arXiv:2510.16579 , year =. 2510.16579 , archivePrefix=

  50. [50]

    Rozi. Code. arXiv preprint arXiv:2308.12950 , year =. 2308.12950 , archivePrefix=

  51. [51]

    arXiv preprint arXiv:2601.21894 , year =

    Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning , author =. arXiv preprint arXiv:2601.21894 , year =. 2601.21894 , archivePrefix=

  52. [52]

    StarCoder 2 and The Stack v2: The Next Generation

    StarCoder 2 and The Stack v2: The Next Generation , author =. arXiv preprint arXiv:2402.19173 , year =. 2402.19173 , archivePrefix=

  53. [53]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author =. arXiv preprint arXiv:2401.14196 , year =. 2401.14196 , archivePrefix=

  54. [54]

    Efficient Training of Language Models to Fill in the Middle

    Efficient Training of Language Models to Fill in the Middle , author =. arXiv preprint arXiv:2207.14255 , year =. 2207.14255 , archivePrefix=

  55. [55]

    and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal =

    Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q. and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal =. 2022 , eprint =

  56. [56]

    Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and Liu, Jia and Mathur, Neel and Muennighoff, Niklas and Ogueji, Kelechi and Mishra, Shreshtha and Sharma, Shubham and Tunstall, Lewis and von Werra, Leandro and Wolf, Thomas , journal =. The. 2022 , eprint =

  57. [57]

    2022 , eprint =

    Fried, Daniel and Aghajanyan, Armen and Lin, Jessy and Wang, Sida and Wallace, Eric and Shi, Freda and Zhong, Ruiqi and Yih, Wen-tau and Zettlemoyer, Luke and Lewis, Mike , journal =. 2022 , eprint =

  58. [58]

    2022 , eprint =

    Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming , journal =. 2022 , eprint =

  59. [59]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    ReCode: Robustness Evaluation of Code Generation Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2023 , month = jul, publisher =

  60. [60]

    arXiv preprint arXiv:2406.12655 , year =

    Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review , author =. arXiv preprint arXiv:2406.12655 , year =. 2406.12655 , archivePrefix=

  61. [61]

    2024 , eprint=

    SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models , author=. 2024 , eprint=

  62. [62]

    2025 , eprint=

    Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks , author=. 2025 , eprint=

  63. [63]

    2012 , isbn =

    Matrix Analysis , author =. 2012 , isbn =

  64. [64]

    1997 , isbn =

    Matrix Analysis , author =. 1997 , isbn =

  65. [65]

    2013 , isbn =

    Matrix Computations , author =. 2013 , isbn =

  66. [66]

    2023 , eprint=

    Self-Consuming Generative Models Go MAD , author=. 2023 , eprint=

  67. [67]

    Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

    Flexible and Efficient Grammar-Constrained Decoding , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

  68. [68]

    Let's Verify Step by Step

    Let's Verify Step by Step , author =. 2023 , journal =. doi:10.48550/arXiv.2305.20050 , eprint =

  69. [69]

    arXiv preprint arXiv:2301.00774 , year=

    Massive language models can be accurately pruned in one-shot , author=. arXiv preprint arXiv:2301.00774 , year=

  70. [70]

    arXiv preprint arXiv:2305.18703 , volume=

    Domain specialization as the key to make large language models disruptive: A comprehensive survey , author=. arXiv preprint arXiv:2305.18703 , volume=

  71. [71]

    A Simple and Effective Pruning Approach for Large Language Models

    A Simple and Effective Pruning Approach for Large Language Models , author=. arXiv preprint arXiv:2306.11695 , year=

  72. [72]

    arXiv preprint arXiv:2305.11627 , year=

    LLM-Pruner: On the Structural Pruning of Large Language Models , author=. arXiv preprint arXiv:2305.11627 , year=

  73. [73]

    arXiv preprint arXiv:2306.11222 , year=

    LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation , author=. arXiv preprint arXiv:2306.11222 , year=

  74. [74]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Structured Pruning for Efficient Generative Pre-trained Language Models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  75. [75]

    arXiv preprint arXiv:2302.04089 , year=

    Ziplm: Hardware-aware structured pruning of language models , author=. arXiv preprint arXiv:2302.04089 , year=

  76. [76]

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=

    AccelTran: A sparsity-aware accelerator for dynamic inference with transformers , author=. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=

  77. [77]

    International Conference on Machine Learning , pages=

    Deja vu: Contextual sparsity for efficient llms at inference time , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  78. [78]

    Foundations and Trends

    Distributed optimization and statistical learning via the alternating direction method of multipliers , author=. Foundations and Trends. 2011 , publisher=

  79. [79]

    Proceedings of the European conference on computer vision (ECCV) , pages=

    A systematic dnn weight pruning framework using alternating direction method of multipliers , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

  80. [80]

    IEEE transactions on neural networks and learning systems , volume=

    Structadmm: Achieving ultrahigh efficiency in structured pruning for dnns , author=. IEEE transactions on neural networks and learning systems , volume=. 2021 , publisher=

Showing first 80 references.