Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

Mehmet Iscan

arxiv: 2606.31511 · v1 · pith:CIAJWEWCnew · submitted 2026-06-30 · 💻 cs.SE · cs.CL· cs.LG

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

Mehmet Iscan This is my paper

Pith reviewed 2026-07-01 04:41 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords self-repaircode generationfalsificationplacebo controlfrozen modelsexecutable counterexamplespreregistered experimentfeedback decomposition

0 comments

The pith

Feedback in frozen code models repairs programs by enabling comparison to executable counterexamples rather than by re-exposure to failing code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether self-repair feedback in small frozen code models gains its value from seeing the failing code again or from opening a new program conjecture to external executable criticism. It builds a preregistered placebo-controlled instrument that decomposes feedback packets against blind resampling at matched generation budget and against content-free shape-matched placebos. Experiments across HumanEval+ and MBPP+ cells show blind resampling and code augmented with facts both outperform bare-code retry and generic placebos, while instruction-only feedback does not. A sympathetic reader would care because the results point to a concrete mechanism that can be isolated and tested in deployment settings where models cannot be retrained.

Core claim

Across six HumanEval+/MBPP+ cells with three 0.5B-1.5B frozen models, 290 dead task-cell units were evaluated with 7,000 fresh generations in the main run and 1,400 more in a preregistered follow-up. Blind resampling exceeded bare-code retry by +18 net unlocks. Code-plus-facts recovered +18 over bare code and +15 over a generic-bullet placebo. An instruction-only effect was not distinguishable. Code-plus-facts and blind resampling tied at 26 unlocks each. Six external-controller follow-ups tied a content-free shape placebo. In this regime, falsification helped not as vocabulary or self-critique, but as comparison with external, executable counterexamples. The contribution is a reflexive meth

What carries the argument

The placebo-controlled instrument that decomposes the feedback packet by contrasting its elements against a blind-resampling baseline at matched output-generation budget and content-free shape-matched placebos.

If this is right

Blind resampling at matched budget produces more net repairs than retrying with the original failing code alone.
Augmenting code with facts from test executions outperforms both bare code and generic-bullet placebos.
Instruction-only feedback does not yield a distinguishable improvement over placebos.
The decomposition method allows any feedback-content claim to be audited through fresh generations and external execution checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the advantage requires executable test results rather than any descriptive text, then feedback prompts should be redesigned to surface verifiable counterexamples first.
The same decomposition could be applied to non-code tasks such as mathematical derivations to test whether external criticism generalizes beyond program execution.
Scaling the instrument to models above 1.5B parameters would show whether the falsification effect remains dominant or yields to other mechanisms.
Repeating the discordant-pair tests on a different benchmark set would confirm whether the +18 unlock margin holds outside the six cells examined.

Load-bearing premise

The assumption that the placebo controls and blind resampling are truly content-free or matched in a way that isolates the falsification effect without introducing other variables like generation quality differences.

What would settle it

A follow-up run that keeps the same generation budget and shape but replaces the executable counterexamples with non-executable descriptive text, and finds that the net unlock advantage over bare code disappears.

Figures

Figures reproduced from arXiv: 2606.31511 by Mehmet Iscan.

**Figure 1.** Figure 1: Falsification-centered measurement instrument. Each dead unit for which no candidate in the cached pool passed the public tier was evaluated under five regeneration arms with the same output-generation budget. Four rounds were run for each arm, one candidate was generated in each round, and a fresh seed from a dedicated namespace was used. The mirror chain links the CF-SF packet to the TC trajectory and th… view at source ↗

**Figure 2.** Figure 2: Per-cell net discordant effects for the three family claims across the six benchmark×model cells. The dashed line denotes zero. C1′ and C2′ are in the positive direction in all six cells. C3′ is positive in four of six cells, with a worst cell of −1. based on 5,800 generations from the fixed sample. The remaining 1,200 generations were produced for the disjoint sensitivity sample. An additional 1,400 fresh… view at source ↗

**Figure 3.** Figure 3: Discovery and fresh-generation confirmation. Descriptive contrasts observed with discovery seeds and preregistered estimates obtained with 100% fresh confirmation seeds are shown on the same n = 290 units. Net effects were reproduced as +18 for both contrasts, although discordant compositions changed. six cells. The preregistered cell guard was defined for C1′ and C2′ in each benchmark×model cell by the fo… view at source ↗

**Figure 4.** Figure 4: Pooled true unlocks across 290 dead units. Blind resampling (B) and code-plus-facts (CF-SF) tied with 26 unlocks. Facts only (TC) produced 15, placebo (GR-SHAPE) produced 11, and bare code (P) produced 8 unlocks [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Observed comparison with blind resampling. Per-cell true-unlock counts for CF-SF and B are heterogeneous around the identity line. Pooled over all cells, the two arms tied with 26 unlocks and produced a symmetric 20/20 discordant split. This result does not imply equivalence. B-only directions ( [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a generated program is a conjecture and a test-execution violation is an oracle-relative, executable counterexample, so feedback's value should be attributed not to re-exposure to failing code but to whether the conjecture is opened to external, executable criticism. As the third stage of a falsification-centered measurement program, this study builds a placebo-controlled instrument that decomposes the feedback packet against a blind-resampling baseline at matched output-generation budget and against content-free, shape-matched placebos. The contribution is not a new repair algorithm but a reflexive methodology (packet decomposition, placebo mirroring, matched-budget discordant-pair tests, fresh-generation confirmation, executable audits) that makes both the model's program conjecture and the researcher's "feedback content works" claim falsifiable. Across six HumanEval+/MBPP+ cells with three 0.5B-1.5B frozen models, 290 dead task-cell units (no best-of-8 candidate passing the public tier) were evaluated; the main run produced 7,000 fresh generations and a preregistered follow-up 1,400 more. Blind resampling exceeded bare-code retry by +18 net unlocks (25/7, Holm p=0.0021). Code-plus-facts recovered +18 over bare code (21/3, p=0.00042) and +15 over a generic-bullet placebo (p=0.0041). An instruction-only effect was not distinguishable (+3, p=0.36). Code-plus-facts and blind resampling tied at 26 unlocks each (not equivalence). Six external-controller follow-ups tied a content-free shape placebo. In this regime, falsification helped not as vocabulary or self-critique, but as comparison with external, executable counterexamples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that external executable counterexamples explain the repair gains in these small frozen models, supported by a preregistered placebo decomposition that beats typical self-repair setups on controls.

read the letter

The punchline is that feedback helps here mainly through comparison to external, executable counterexamples rather than re-exposure to failing code or self-critique. Blind resampling beat bare retry by +18 net unlocks, and code-plus-facts beat a generic-bullet placebo by +15, with the stats holding after Holm correction.

What the paper does well is the reflexive methodology. It decomposes the feedback packet with blind resampling at matched output budget, shape-matched placebos, discordant-pair tests, and fresh-generation confirmation. Preregistration plus external audits and the use of HumanEval+/MBPP+ on 290 dead cells across three small models give it more structure than most self-repair studies. The explicit goal of making both the model conjecture and the research claim falsifiable is a clear step forward.

The soft spot is the placebo isolation. The stress-test concern is fair: if the generic-bullet placebo or blind resampling changes generation quality or diversity through prompt structure alone, the attribution to external counterexamples weakens. The abstract states the controls are content-free and matched but does not report post-hoc generation statistics, which matters for 0.5B-1.5B models where small prompt shifts can matter. Results are also scoped to this dead-cell regime.

This is for people working on evaluation methods in AI code generation who want tighter controls on feedback claims. It deserves peer review because the design is novel and the numbers are reported with enough transparency to let referees check the controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that feedback benefits in self-repair for small frozen code models (0.5B–1.5B) arise specifically from comparison against external, executable counterexamples rather than re-exposure to failing code, vocabulary, or self-critique. It introduces a reflexive, placebo-controlled decomposition instrument using blind resampling at matched output-generation budget and content-free shape-matched placebos on 290 dead HumanEval+/MBPP+ task-cells, reporting statistically significant net unlocks (blind resampling +18 over bare retry, Holm p=0.0021; code-plus-facts +15 over generic-bullet placebo, p=0.0041) across 7000+ fresh generations plus a preregistered follow-up, with external audits and no distinguishable instruction-only effect.

Significance. If the decomposition holds, the work supplies a falsifiable methodology (packet decomposition, placebo mirroring, matched-budget discordant-pair tests, executable audits) for attributing feedback effects in code repair, with explicit credit for preregistration, external audits, fresh-generation confirmation, and parameter-free statistical framing. It reframes self-repair as Popperian conjecture-criticism rather than retry, which could inform deployment practices where retraining is infeasible.

major comments (2)

[Abstract] Abstract (controls and instrument description): The central claim that gains are due to 'comparison with external, executable counterexamples' rather than prompt-structure confounds rests on blind resampling and the generic-bullet placebo being content-free and shape-matched. The manuscript states these controls are budget-matched and shape-matched but does not report post-hoc audits confirming equivalence on generation statistics such as output length or syntactic variety; this is load-bearing because any systematic difference would undermine attribution of the reported +18 net unlocks (25/7) and +15 advantage (p=0.0041).
[Abstract] Abstract (results paragraph): The statement that 'Code-plus-facts and blind resampling tied at 26 unlocks each (not equivalence)' is presented without a formal equivalence test or power analysis for the tie; given the small per-cell sample (290 dead units across six cells), this weakens the claim that the two conditions are interchangeable for isolating the falsification mechanism.

minor comments (2)

[Abstract] The term 'net unlocks' is used repeatedly but its exact definition (e.g., how ties or partial improvements are scored) is not restated in the abstract; a one-sentence operational definition would improve readability.
[Abstract] The six external-controller follow-ups are mentioned only in passing; a brief parenthetical on what 'tied a content-free shape placebo' means operationally would clarify the robustness check.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract's control descriptions and statistical framing. We address each point directly below.

read point-by-point responses

Referee: [Abstract] Abstract (controls and instrument description): The central claim that gains are due to 'comparison with external, executable counterexamples' rather than prompt-structure confounds rests on blind resampling and the generic-bullet placebo being content-free and shape-matched. The manuscript states these controls are budget-matched and shape-matched but does not report post-hoc audits confirming equivalence on generation statistics such as output length or syntactic variety; this is load-bearing because any systematic difference would undermine attribution of the reported +18 net unlocks (25/7) and +15 advantage (p=0.0041).

Authors: We agree that post-hoc verification of generation statistics is warranted to confirm shape-matching. The revised manuscript will add explicit audits reporting mean/variance of output token lengths and syntactic variety (AST node counts and unique token-type diversity) for blind resampling versus the generic-bullet placebo, showing no material differences. These statistics were computed on the existing 7000+ generations but omitted for brevity; they will be inserted into the methods and results. revision: yes
Referee: [Abstract] Abstract (results paragraph): The statement that 'Code-plus-facts and blind resampling tied at 26 unlocks each (not equivalence)' is presented without a formal equivalence test or power analysis for the tie; given the small per-cell sample (290 dead units across six cells), this weakens the claim that the two conditions are interchangeable for isolating the falsification mechanism.

Authors: The parenthetical qualifier was added specifically to avoid any implication of equivalence. We will nevertheless strengthen the revision by including a post-hoc power calculation for detecting a difference of 5 unlocks (given observed variance and n=290) and a brief note that the data remain consistent with comparable performance. The primary claims rest on the significant contrasts versus bare retry and the generic placebo, not on interchangeability of the two high-performing conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from external benchmarks and preregistered controls

full rationale

The paper reports an empirical decomposition using HumanEval+/MBPP+ tasks, net-unlock counts, Holm-adjusted p-values, and matched-budget baselines. No equations, fitted parameters, or self-citations are invoked to derive the central claims; the +18 net unlocks and placebo comparisons are direct experimental outcomes rather than reductions of the inputs. The methodology is presented as falsifiable and reflexive, with no load-bearing self-referential definitions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard statistical assumptions and domain assumptions about code generation tasks; no free parameters or invented entities introduced.

axioms (2)

standard math Statistical tests like Holm correction for multiple comparisons are valid for the reported p-values.
Used in reporting the significance of unlocks.
domain assumption The task cells with no best-of-8 passing are appropriate for testing repair feedback.
290 dead task-cell units selected based on this.

pith-pipeline@v0.9.1-grok · 5903 in / 1408 out tokens · 41920 ms · 2026-07-01T04:41:46.339380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 48 canonical work pages · 11 internal anchors

[1]

Program Synthesis with Large Language Models

austin2021mbpp APACrefauthors Austin, J. , Odena, A. , Nye, M. , Bosma, M. , Michalewski, H. , Dohan, D. , Jiang, E. , Cai, C. , Terry, M. , Le, Q. \ Sutton, C. APACrefauthors \ 2021 . Program Synthesis with Large Language Models. Program synthesis with large language models. APACrefURL https://arxiv.org/abs/2108.07732 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

brown2024monkeys APACrefauthors Brown, B. , Juravsky, J. , Ehrlich, R. , Clark, R. , Le, Q V. , R \'e , C. \ Mirhoseini, A. APACrefauthors \ 2024 . Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Large language monkeys: Scaling inference compute with repeated sampling. APACrefURL https://arxiv.org/abs/2407.21787 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

\ Toubia, O

brucks2025prompt APACrefauthors Brucks, M. \ Toubia, O. APACrefauthors \ 2025 . Prompt Architecture Induces Methodological Artifacts in Large Language Models Prompt architecture induces methodological artifacts in large language models . PLOS ONE 20 4 e0319159 . APACrefDOI doi:10.1371/journal.pone.0319159 APACrefDOI

work page doi:10.1371/journal.pone.0319159 2025
[4]

, Tang, H

chen2024funcoder APACrefauthors Chen, J. , Tang, H. , Chu, Z. , Chen, Q. , Wang, Z. , Liu, M. \ Qin, B. APACrefauthors \ 2024 . Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation Divide-and-conquer meets consensus: Unleashing the power of functions in code generation . Advances in Neural Information Processing Systems...

work page arXiv 2024
[5]

Evaluating Large Language Models Trained on Code

chen2021humaneval APACrefauthors Chen, M. , Tworek, J. , Jun, H. , Yuan, Q. , de Oliveira Pinto, H P. , Kaplan, J. , Edwards, H. , Burda, Y. , Joseph, N. , Brockman, G. , Ray, A. , Puri, R. , Krueger, G. , Petrov, M. , Khlaaf, H. , Sastry, G. , Mishkin, P. , Chan, B. , Gray, S. , Ryder, N. , Pavlov, M. , Power, A. , Kaiser, L. , Bavarian, M. , Winter, C. ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Teaching Large Language Models to Self-Debug

chen2024selfdebug APACrefauthors Chen, X. , Lin, M. , Sch\" a rli, N. \ Zhou, D. APACrefauthors \ 2024 . Teaching Large Language Models to Self-Debug Teaching large language models to self-debug . The Twelfth International Conference on Learning Representations. The twelfth international conference on learning representations. APACrefURL https://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

, Kang, D

cho2025cocos APACrefauthors Cho, J. , Kang, D. , Kim, H. \ Lee, G G. APACrefauthors \ 2025 . Self-Correcting Code Generation Using Small Language Models Self-correcting code generation using small language models . Findings of the Association for Computational Linguistics: EMNLP 2025 Findings of the association for computational linguistics: Emnlp 2025 \ ...

work page doi:10.18653/v1/2025.findings-emnlp.127 2025
[8]

, Tennenholtz, G

chow2025inferenceaware APACrefauthors Chow, Y. , Tennenholtz, G. , Gur, I. , Zhuang, V. , Dai, B. , Thiagarajan, S. , Boutilier, C. , Agarwal, R. , Kumar, A. \ Faust, A. APACrefauthors \ 2025 . Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models Inference-aware fine-tuning for best-of-n sampling in large language models . Internati...

work page arXiv 2025
[9]

, Shenfeld, I

damani2024hard APACrefauthors Damani, M. , Shenfeld, I. , Peng, A. , Bobu, A. \ Andreas, J. APACrefauthors \ 2025 . Learning How Hard to Think: Input-Adaptive Allocation of LM Computation Learning how hard to think: Input-adaptive allocation of LM computation . International Conference on Learning Representations (ICLR). International conference on learni...

work page arXiv 2025
[10]

, Min, M J

ding2024cycle APACrefauthors Ding, Y. , Min, M J. , Kaiser, G. \ Ray, B. APACrefauthors \ 2024 . CYCLE : Learning to Self-Refine the Code Generation CYCLE : Learning to self-refine the code generation . Proceedings of the ACM on Programming Languages (OOPSLA). Proceedings of the acm on programming languages (oopsla). APACrefURL https://arxiv.org/abs/2403....

work page arXiv 2024
[11]

, Zhao, J

dinh2023bugs APACrefauthors Dinh, T. , Zhao, J. , Tan, S. , Negrinho, R. , Lausen, L. , Zha, S. \ Karypis, G. APACrefauthors \ 2023 . Large Language Models of Code Fail at Completing Code with Potential Bugs Large language models of code fail at completing code with potential bugs . Advances in Neural Information Processing Systems 36 (NeurIPS). Advances ...

work page arXiv 2023
[12]

, Lydersen, S

fagerland2013mcnemar APACrefauthors Fagerland, M W. , Lydersen, S. \ Laake, P. APACrefauthors \ 2013 . The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional . BMC Medical Research Methodology 13 1 91 . APA...

work page doi:10.1186/1471-2288-13-91 2013
[13]

\ Bedrick, S

gorman2019splits APACrefauthors Gorman, K. \ Bedrick, S. APACrefauthors \ 2019 . We Need to Talk about Standard Splits We need to talk about standard splits . Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) Proceedings of the 57th annual meeting of the association for computational linguistics (acl) \ ( \ 2786...

work page doi:10.18653/v1/p19-1267 2019
[14]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

guo2024deepseekcoder APACrefauthors Guo, D. , Zhu, Q. , Yang, D. , Xie, Z. , Dong, K. , Zhang, W. , Chen, G. , Bi, X. , Wu, Y. , Li, Y K. , Luo, F. , Xiong, Y. \ Liang, W. APACrefauthors \ 2024 . DeepSeek-Coder : When the Large Language Model Meets Programming -- The Rise of Code Intelligence. DeepSeek-Coder : When the large language model meets programmi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

, Chatzimparmpas, A

hofman2023prereg APACrefauthors Hofman, J M. , Chatzimparmpas, A. , Sharma, A. , Watts, D J. \ Hullman, J. APACrefauthors \ 2023 . Pre-registration for Predictive Modeling. Pre-registration for predictive modeling. APACrefURL https://arxiv.org/abs/2311.18807 APACrefURL

work page arXiv 2023
[16]

APACrefauthors \ 1979

holm1979simple APACrefauthors Holm, S. APACrefauthors \ 1979 . A Simple Sequentially Rejective Multiple Test Procedure A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics 6 2 65--70

1979
[17]

, Chen, X

huang2024cannot APACrefauthors Huang, J. , Chen, X. , Mishra, S. , Zheng, H S. , Yu, A W. , Song, X. \ Zhou, D. APACrefauthors \ 2024 . Large Language Models Cannot Self-Correct Reasoning Yet Large language models cannot self-correct reasoning yet . The Twelfth International Conference on Learning Representations. The twelfth international conference on l...

2024
[18]

Qwen2.5-Coder Technical Report

hui2024qwen25coder APACrefauthors Hui, B. , Yang, J. , Cui, Z. , Yang, J. , Liu, D. , Zhang, L. , Liu, T. , Zhang, J. , Yu, B. , Lu, K. , Dang, K. , Fan, Y. , Zhang, Y. , Yang, A. , Men, R. , Huang, F. , Zheng, B. , Miao, Y. , Quan, S. , Feng, Y. , Ren, X. , Ren, X. , Zhou, J. \ Lin, J. APACrefauthors \ 2024 . Qwen2.5-Coder Technical Report. Qwen2.5-Coder...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

iscan2026scaffold APACrefauthors \. I s can, M. APACrefauthors \ 2026 1 . Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill. Scaffold, not vocabulary? a controlled, two-tier, pre-registered study of a popperian code-generation skill. APACrefURL https://arxiv.org/abs/2606.06454 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

I s can, M

iscan2026selection APACrefauthors \. I s can, M. APACrefauthors \ 2026 2 . Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models. Selection without signal, recovery through expression: A measurement study of post-hoc falsification operators for frozen small code models. ...

work page arXiv 2026
[21]

jiang2024ledex APACrefauthors Jiang, N. , Li, X. , Wang, S. , Zhou, Q. , Hossain, S B. , Ray, B. , Kumar, V. , Ma, X. \ Deoras, A. APACrefauthors \ 2024 . LeDex : Training LLMs to Better Self-Debug and Explain Code LeDex : Training LLMs to better self-debug and explain code . Advances in Neural Information Processing Systems 37 (NeurIPS). Advances in neur...

work page arXiv 2024
[22]

\ Chen, H

jin2025selfcritiquefail APACrefauthors Jin, H. \ Chen, H. APACrefauthors \ 2025 . Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications Uncovering systematic failures of LLMs in verifying code against natural language specifications . 2025 40th IEEE/ACM International Conference on Automated Software Engineering (...

work page doi:10.1109/ase63991.2025.00323 2025
[23]

\ Steinhardt, J

jones2022cognitivebiases APACrefauthors Jones, E. \ Steinhardt, J. APACrefauthors \ 2022 . Capturing Failures of Large Language Models via Human Cognitive Biases Capturing failures of large language models via human cognitive biases . Advances in Neural Information Processing Systems 35 (NeurIPS). Advances in neural information processing systems 35 (neur...

work page arXiv 2022
[24]

, de Oliveira Neto, F G

khojah2025promptprogramming APACrefauthors Khojah, R. , de Oliveira Neto, F G. , Mohamad, M. \ Leitner, P. APACrefauthors \ 2025 . The Impact of Prompt Programming on Function-Level Code Generation The impact of prompt programming on function-level code generation . IEEE Transactions on Software Engineering 51 8 2381--2401 . APACrefDOI doi:10.1109/TSE.202...

work page doi:10.1109/tse.2025.3587794 2025
[25]

, Wang, C

lam2025codecrash APACrefauthors Lam, M H. , Wang, C. , Huang, J t. \ Lyu, M R. APACrefauthors \ 2025 . CodeCrash : Stress Testing LLM Reasoning under Structural and Semantic Perturbations. CodeCrash : Stress testing LLM reasoning under structural and semantic perturbations. APACrefURL https://arxiv.org/abs/2504.14119 APACrefURL

work page arXiv 2025
[26]

, Katz, S

lavon2025egcfg APACrefauthors Lavon, B. , Katz, S. \ Wolf, L. APACrefauthors \ 2025 . Execution Guided Line-by-Line Code Generation. Execution guided line-by-line code generation. APACrefURL https://arxiv.org/abs/2506.10948 APACrefURL

work page arXiv 2025
[27]

, Xia, C S

liu2023evalplus APACrefauthors Liu, J. , Xia, C S. , Wang, Y. \ Zhang, L. APACrefauthors \ 2023 . Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation . Advances in Neural Informat...

2023
[28]

Self-Refine: Iterative Refinement with Self-Feedback

madaan2023selfrefine APACrefauthors Madaan, A. , Tandon, N. , Gupta, P. , Hallinan, S. , Gao, L. , Wiegreffe, S. , Alon, U. , Dziri, N. , Prabhumoye, S. , Yang, Y. , Gupta, S. , Majumder, B P. , Hermann, K. , Welleck, S. , Yazdanbakhsh, A. \ Clark, P. APACrefauthors \ 2023 . Self-Refine : Iterative Refinement with Self-Feedback Self-Refine : Iterative ref...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

\ Spanos, A

mayospanos2006severe APACrefauthors Mayo, D G. \ Spanos, A. APACrefauthors \ 2006 . Severe Testing as a Basic Concept in a Neyman--Pearson Philosophy of Induction Severe testing as a basic concept in a Neyman--Pearson philosophy of induction . The British Journal for the Philosophy of Science 57 2 323--357 . Advance Access published April 11, 2006 APACref...

work page doi:10.1093/bjps/axl003 2006
[30]

APACrefauthors \ 1947

mcnemar1947note APACrefauthors McNemar, Q. APACrefauthors \ 1947 . Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages Note on the sampling error of the difference between correlated proportions or percentages . Psychometrika 12 2 153--157 . APACrefDOI doi:10.1007/BF02295996 APACrefDOI

work page doi:10.1007/bf02295996 1947
[31]

, Inala, J P

olausson2024selfrepair APACrefauthors Olausson, T X. , Inala, J P. , Wang, C. , Gao, J. \ Solar-Lezama, A. APACrefauthors \ 2024 . Is Self-Repair a Silver Bullet for Code Generation? Is self-repair a silver bullet for code generation? Proceedings of the Twelfth International Conference on Learning Representations (ICLR). Proceedings of the twelfth interna...

2024
[32]

, Bowman, S R

panickssery2024selfrecognition APACrefauthors Panickssery, A. , Bowman, S R. \ Feng, S. APACrefauthors \ 2024 . LLM Evaluators Recognize and Favor Their Own Generations LLM evaluators recognize and favor their own generations . Advances in Neural Information Processing Systems 37 (NeurIPS 2024). Advances in neural information processing systems 37 (neurip...

work page doi:10.52202/079017-2197 2024
[33]

APACrefauthors \ 1959

popper1959logic APACrefauthors Popper, K R. APACrefauthors \ 1959 . The Logic of Scientific Discovery The logic of scientific discovery . London Hutchinson . English translation of Logik der Forschung (1934); degree of corroboration C(h,e) in Appendix *ix

1959
[34]

APACrefauthors \ 1963

popper1963conjectures APACrefauthors Popper, K R. APACrefauthors \ 1963 . Conjectures and Refutations: The Growth of Scientific Knowledge Conjectures and refutations: The growth of scientific knowledge . London Routledge and Kegan Paul . Verisimilitude/truthlikeness in ch. 10 and Addenda

1963
[35]

\ Albassam, E

rizqullah2026tdp APACrefauthors Rizqullah, M. \ Albassam, E. APACrefauthors \ 2026 . Model-Agnostic Empirical Evaluation of Test-Driven Prompt Engineering on Improving Accuracy and Efficiency in Large Language Models Python Code Generation Model-agnostic empirical evaluation of test-driven prompt engineering on improving accuracy and efficiency in large l...

work page doi:10.1109/access.2026.3662817 2026
[36]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

sclar2024formatting APACrefauthors Sclar, M. , Choi, Y. , Tsvetkov, Y. \ Suhr, A. APACrefauthors \ 2024 . Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design, or: How I Learned to Start Worrying about Prompt Formatting Quantifying language models' sensitivity to spurious features in prompt design, or: How I learned to start worr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Reflexion: Language Agents with Verbal Reinforcement Learning

shinn2023reflexion APACrefauthors Shinn, N. , Cassano, F. , Berman, E. , Gopinath, A. , Narasimhan, K. \ Yao, S. APACrefauthors \ 2023 . Reflexion : Language Agents with Verbal Reinforcement Learning Reflexion : Language agents with verbal reinforcement learning . Advances in Neural Information Processing Systems (NeurIPS). Advances in neural information ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

, Bansal, H

singhi2025solve APACrefauthors Singhi, N. , Bansal, H. , Hosseini, A. , Grover, A. , Chang, K W. , Rohrbach, M. \ Rohrbach, A. APACrefauthors \ 2025 . When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning When to solve, when to verify: Compute-optimal problem solving and generative verification for LL...

work page arXiv 2025
[39]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

snell2024testtime APACrefauthors Snell, C. , Lee, J. , Xu, K. \ Kumar, A. APACrefauthors \ 2025 . Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters Scaling LLM test-time compute optimally can be more effective than scaling model parameters . International Conference on Learning Representations (ICLR). Internationa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

, Zhang, H

song2025mindthegap APACrefauthors Song, Y. , Zhang, H. , Eisenach, C. , Kakade, S M. , Foster, D. \ Ghai, U. APACrefauthors \ 2025 . Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models Mind the gap: Examining the self-improvement capabilities of large language models . International Conference on Learning Representations (IC...

work page arXiv 2025
[41]

, Valmeekam, K

stechly2025selfverification APACrefauthors Stechly, K. , Valmeekam, K. \ Kambhampati, S. APACrefauthors \ 2025 . On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks On the self-verification limitations of large language models on reasoning and planning tasks . International Conference on Learning Representations (...

work page arXiv 2025
[42]

, Gautam, A

takallou2026llms APACrefauthors Takallou, M A. , Gautam, A. \ Shirolkar, P. APACrefauthors \ 2026 . LLMs for End-to-End Machine Learning: A Comprehensive Review LLMs for end-to-end machine learning: A comprehensive review . IEEE Access 14 47717--47738 . APACrefDOI doi:10.1109/ACCESS.2026.3678429 APACrefDOI

work page doi:10.1109/access.2026.3678429 2026
[43]

tian2024debugbench APACrefauthors Tian, R. , Ye, Y. , Qin, Y. , Cong, X. , Lin, Y. , Pan, Y. , Wu, Y. , Hui, H. , Liu, W. , Liu, Z. \ Sun, M. APACrefauthors \ 2024 . DebugBench : Evaluating Debugging Capability of Large Language Models DebugBench : Evaluating debugging capability of large language models . Findings of the Association for Computational Lin...

work page arXiv 2024
[44]

\ Kahneman, D

tversky1974judgment APACrefauthors Tversky, A. \ Kahneman, D. APACrefauthors \ 1974 . Judgment under Uncertainty: Heuristics and Biases Judgment under uncertainty: Heuristics and biases . Science 185 4157 1124--1131 . APACrefDOI doi:10.1126/science.185.4157.1124 APACrefDOI

work page doi:10.1126/science.185.4157.1124 1974
[45]

, Marquez, M

valmeekam2023self APACrefauthors Valmeekam, K. , Marquez, M. \ Kambhampati, S. APACrefauthors \ 2023 . Can Large Language Models Really Improve by Self-critiquing Their Own Plans? Can Large Language Models Really Improve by Self-critiquing Their Own Plans? APACrefURL https://arxiv.org/abs/2310.08118 APACrefURL Preprint. arXiv:2310.08118, 12 Oct 2023

work page arXiv 2023
[46]

, Cassano, F

wang2025plansearch APACrefauthors Wang, E. , Cassano, F. , Wu, C. , Bai, Y. , Song, W. , Nath, V. , Han, Z. , Hendryx, S. , Yue, S. \ Zhang, H. APACrefauthors \ 2025 . Planning in Natural Language Improves LLM Search for Code Generation Planning in natural language improves LLM search for code generation . International Conference on Learning Representati...

work page arXiv 2025
[47]

, Huang, Y

wang2024software APACrefauthors Wang, J. , Huang, Y. , Chen, C. , Liu, Z. , Wang, S. \ Wang, Q. APACrefauthors \ 2024 . Software Testing With Large Language Models: Survey, Landscape, and Vision Software testing with large language models: Survey, landscape, and vision . IEEE Transactions on Software Engineering 50 4 911--936 . APACrefDOI doi:10.1109/TSE....

work page doi:10.1109/tse.2024.3368208 2024
[48]

, Zhou, Y

yang2024cotton APACrefauthors Yang, G. , Zhou, Y. , Chen, X. , Zhang, X. , Zhuo, T Y. \ Chen, T. APACrefauthors \ 2024 . Chain-of-Thought in Neural Code Generation: From and for Lightweight Language Models Chain-of-thought in neural code generation: From and for lightweight language models . IEEE Transactions on Software Engineering 50 9 2437--2457 . APAC...

work page doi:10.1109/tse.2024.3440503 2024
[49]

, Wang, D

zhang2023algo APACrefauthors Zhang, K. , Wang, D. , Xia, J. , Wang, W Y. \ Li, L. APACrefauthors \ 2023 . ALGO : Synthesizing Algorithmic Programs with LLM -Generated Oracle Verifiers ALGO : Synthesizing algorithmic programs with LLM -generated oracle verifiers . Advances in Neural Information Processing Systems 36 (NeurIPS). Advances in neural informatio...

work page arXiv 2023
[50]

, Decugis, J

zheng2025multiturn APACrefauthors Zheng, K. , Decugis, J. , Gehring, J. , Cohen, T. , Negrevergne, B. \ Synnaeve, G. APACrefauthors \ 2025 . What Makes Large Language Models Reason in (Multi-Turn) Code Generation? What makes large language models reason in (multi-turn) code generation? International Conference on Learning Representations (ICLR). Internati...

work page arXiv 2025
[51]

, Zhang, G

zheng2024opencodeinterpreter APACrefauthors Zheng, T. , Zhang, G. , Shen, T. , Liu, X. , Lin, B Y. , Fu, J. , Chen, W. \ Yue, X. APACrefauthors \ 2024 . OpenCodeInterpreter : Integrating Code Generation with Execution and Refinement OpenCodeInterpreter : Integrating code generation with execution and refinement . Findings of the Association for Computatio...

work page doi:10.18653/v1/2024.findings-acl.762 2024
[52]

, Ning, K

zheng2025understanding APACrefauthors Zheng, Z. , Ning, K. , Zhong, Q. , Chen, J. , Chen, W. , Guo, L. , Wang, W. \ Wang, Y. APACrefauthors \ 2025 . Towards an Understanding of Large Language Models in Software Engineering Tasks Towards an understanding of large language models in software engineering tasks . Empirical Software Engineering 30 2 50 . APACr...

work page doi:10.1007/s10664-024-10602-0 2025
[53]

, Wang, Z

zhong2024ldb APACrefauthors Zhong, L. , Wang, Z. \ Shang, J. APACrefauthors \ 2024 . Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step Debug like a human: A large language model debugger via verifying runtime execution step by step . Findings of the Association for Computational Linguistics: ACL 2024 Findings...

work page doi:10.18653/v1/2024.findings-acl.49 2024
[54]

, Al-Hitmi, M

zubair2025repair APACrefauthors Zubair, F. , Al-Hitmi, M. \ Catal, C. APACrefauthors \ 2025 . The Use of Large Language Models for Program Repair The use of large language models for program repair . Computer Standards & Interfaces 93 103951 . APACrefDOI doi:10.1016/j.csi.2024.103951 APACrefDOI

work page doi:10.1016/j.csi.2024.103951 2025

[1] [1]

Program Synthesis with Large Language Models

austin2021mbpp APACrefauthors Austin, J. , Odena, A. , Nye, M. , Bosma, M. , Michalewski, H. , Dohan, D. , Jiang, E. , Cai, C. , Terry, M. , Le, Q. \ Sutton, C. APACrefauthors \ 2021 . Program Synthesis with Large Language Models. Program synthesis with large language models. APACrefURL https://arxiv.org/abs/2108.07732 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

brown2024monkeys APACrefauthors Brown, B. , Juravsky, J. , Ehrlich, R. , Clark, R. , Le, Q V. , R \'e , C. \ Mirhoseini, A. APACrefauthors \ 2024 . Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Large language monkeys: Scaling inference compute with repeated sampling. APACrefURL https://arxiv.org/abs/2407.21787 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

\ Toubia, O

brucks2025prompt APACrefauthors Brucks, M. \ Toubia, O. APACrefauthors \ 2025 . Prompt Architecture Induces Methodological Artifacts in Large Language Models Prompt architecture induces methodological artifacts in large language models . PLOS ONE 20 4 e0319159 . APACrefDOI doi:10.1371/journal.pone.0319159 APACrefDOI

work page doi:10.1371/journal.pone.0319159 2025

[4] [4]

, Tang, H

chen2024funcoder APACrefauthors Chen, J. , Tang, H. , Chu, Z. , Chen, Q. , Wang, Z. , Liu, M. \ Qin, B. APACrefauthors \ 2024 . Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation Divide-and-conquer meets consensus: Unleashing the power of functions in code generation . Advances in Neural Information Processing Systems...

work page arXiv 2024

[5] [5]

Evaluating Large Language Models Trained on Code

chen2021humaneval APACrefauthors Chen, M. , Tworek, J. , Jun, H. , Yuan, Q. , de Oliveira Pinto, H P. , Kaplan, J. , Edwards, H. , Burda, Y. , Joseph, N. , Brockman, G. , Ray, A. , Puri, R. , Krueger, G. , Petrov, M. , Khlaaf, H. , Sastry, G. , Mishkin, P. , Chan, B. , Gray, S. , Ryder, N. , Pavlov, M. , Power, A. , Kaiser, L. , Bavarian, M. , Winter, C. ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Teaching Large Language Models to Self-Debug

chen2024selfdebug APACrefauthors Chen, X. , Lin, M. , Sch\" a rli, N. \ Zhou, D. APACrefauthors \ 2024 . Teaching Large Language Models to Self-Debug Teaching large language models to self-debug . The Twelfth International Conference on Learning Representations. The twelfth international conference on learning representations. APACrefURL https://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

, Kang, D

cho2025cocos APACrefauthors Cho, J. , Kang, D. , Kim, H. \ Lee, G G. APACrefauthors \ 2025 . Self-Correcting Code Generation Using Small Language Models Self-correcting code generation using small language models . Findings of the Association for Computational Linguistics: EMNLP 2025 Findings of the association for computational linguistics: Emnlp 2025 \ ...

work page doi:10.18653/v1/2025.findings-emnlp.127 2025

[8] [8]

, Tennenholtz, G

chow2025inferenceaware APACrefauthors Chow, Y. , Tennenholtz, G. , Gur, I. , Zhuang, V. , Dai, B. , Thiagarajan, S. , Boutilier, C. , Agarwal, R. , Kumar, A. \ Faust, A. APACrefauthors \ 2025 . Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models Inference-aware fine-tuning for best-of-n sampling in large language models . Internati...

work page arXiv 2025

[9] [9]

, Shenfeld, I

damani2024hard APACrefauthors Damani, M. , Shenfeld, I. , Peng, A. , Bobu, A. \ Andreas, J. APACrefauthors \ 2025 . Learning How Hard to Think: Input-Adaptive Allocation of LM Computation Learning how hard to think: Input-adaptive allocation of LM computation . International Conference on Learning Representations (ICLR). International conference on learni...

work page arXiv 2025

[10] [10]

, Min, M J

ding2024cycle APACrefauthors Ding, Y. , Min, M J. , Kaiser, G. \ Ray, B. APACrefauthors \ 2024 . CYCLE : Learning to Self-Refine the Code Generation CYCLE : Learning to self-refine the code generation . Proceedings of the ACM on Programming Languages (OOPSLA). Proceedings of the acm on programming languages (oopsla). APACrefURL https://arxiv.org/abs/2403....

work page arXiv 2024

[11] [11]

, Zhao, J

dinh2023bugs APACrefauthors Dinh, T. , Zhao, J. , Tan, S. , Negrinho, R. , Lausen, L. , Zha, S. \ Karypis, G. APACrefauthors \ 2023 . Large Language Models of Code Fail at Completing Code with Potential Bugs Large language models of code fail at completing code with potential bugs . Advances in Neural Information Processing Systems 36 (NeurIPS). Advances ...

work page arXiv 2023

[12] [12]

, Lydersen, S

fagerland2013mcnemar APACrefauthors Fagerland, M W. , Lydersen, S. \ Laake, P. APACrefauthors \ 2013 . The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional . BMC Medical Research Methodology 13 1 91 . APA...

work page doi:10.1186/1471-2288-13-91 2013

[13] [13]

\ Bedrick, S

gorman2019splits APACrefauthors Gorman, K. \ Bedrick, S. APACrefauthors \ 2019 . We Need to Talk about Standard Splits We need to talk about standard splits . Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) Proceedings of the 57th annual meeting of the association for computational linguistics (acl) \ ( \ 2786...

work page doi:10.18653/v1/p19-1267 2019

[14] [14]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

guo2024deepseekcoder APACrefauthors Guo, D. , Zhu, Q. , Yang, D. , Xie, Z. , Dong, K. , Zhang, W. , Chen, G. , Bi, X. , Wu, Y. , Li, Y K. , Luo, F. , Xiong, Y. \ Liang, W. APACrefauthors \ 2024 . DeepSeek-Coder : When the Large Language Model Meets Programming -- The Rise of Code Intelligence. DeepSeek-Coder : When the large language model meets programmi...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

, Chatzimparmpas, A

hofman2023prereg APACrefauthors Hofman, J M. , Chatzimparmpas, A. , Sharma, A. , Watts, D J. \ Hullman, J. APACrefauthors \ 2023 . Pre-registration for Predictive Modeling. Pre-registration for predictive modeling. APACrefURL https://arxiv.org/abs/2311.18807 APACrefURL

work page arXiv 2023

[16] [16]

APACrefauthors \ 1979

holm1979simple APACrefauthors Holm, S. APACrefauthors \ 1979 . A Simple Sequentially Rejective Multiple Test Procedure A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics 6 2 65--70

1979

[17] [17]

, Chen, X

huang2024cannot APACrefauthors Huang, J. , Chen, X. , Mishra, S. , Zheng, H S. , Yu, A W. , Song, X. \ Zhou, D. APACrefauthors \ 2024 . Large Language Models Cannot Self-Correct Reasoning Yet Large language models cannot self-correct reasoning yet . The Twelfth International Conference on Learning Representations. The twelfth international conference on l...

2024

[18] [18]

Qwen2.5-Coder Technical Report

hui2024qwen25coder APACrefauthors Hui, B. , Yang, J. , Cui, Z. , Yang, J. , Liu, D. , Zhang, L. , Liu, T. , Zhang, J. , Yu, B. , Lu, K. , Dang, K. , Fan, Y. , Zhang, Y. , Yang, A. , Men, R. , Huang, F. , Zheng, B. , Miao, Y. , Quan, S. , Feng, Y. , Ren, X. , Ren, X. , Zhou, J. \ Lin, J. APACrefauthors \ 2024 . Qwen2.5-Coder Technical Report. Qwen2.5-Coder...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

iscan2026scaffold APACrefauthors \. I s can, M. APACrefauthors \ 2026 1 . Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill. Scaffold, not vocabulary? a controlled, two-tier, pre-registered study of a popperian code-generation skill. APACrefURL https://arxiv.org/abs/2606.06454 APACrefURL

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

I s can, M

iscan2026selection APACrefauthors \. I s can, M. APACrefauthors \ 2026 2 . Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models. Selection without signal, recovery through expression: A measurement study of post-hoc falsification operators for frozen small code models. ...

work page arXiv 2026

[21] [21]

jiang2024ledex APACrefauthors Jiang, N. , Li, X. , Wang, S. , Zhou, Q. , Hossain, S B. , Ray, B. , Kumar, V. , Ma, X. \ Deoras, A. APACrefauthors \ 2024 . LeDex : Training LLMs to Better Self-Debug and Explain Code LeDex : Training LLMs to better self-debug and explain code . Advances in Neural Information Processing Systems 37 (NeurIPS). Advances in neur...

work page arXiv 2024

[22] [22]

\ Chen, H

jin2025selfcritiquefail APACrefauthors Jin, H. \ Chen, H. APACrefauthors \ 2025 . Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications Uncovering systematic failures of LLMs in verifying code against natural language specifications . 2025 40th IEEE/ACM International Conference on Automated Software Engineering (...

work page doi:10.1109/ase63991.2025.00323 2025

[23] [23]

\ Steinhardt, J

jones2022cognitivebiases APACrefauthors Jones, E. \ Steinhardt, J. APACrefauthors \ 2022 . Capturing Failures of Large Language Models via Human Cognitive Biases Capturing failures of large language models via human cognitive biases . Advances in Neural Information Processing Systems 35 (NeurIPS). Advances in neural information processing systems 35 (neur...

work page arXiv 2022

[24] [24]

, de Oliveira Neto, F G

khojah2025promptprogramming APACrefauthors Khojah, R. , de Oliveira Neto, F G. , Mohamad, M. \ Leitner, P. APACrefauthors \ 2025 . The Impact of Prompt Programming on Function-Level Code Generation The impact of prompt programming on function-level code generation . IEEE Transactions on Software Engineering 51 8 2381--2401 . APACrefDOI doi:10.1109/TSE.202...

work page doi:10.1109/tse.2025.3587794 2025

[25] [25]

, Wang, C

lam2025codecrash APACrefauthors Lam, M H. , Wang, C. , Huang, J t. \ Lyu, M R. APACrefauthors \ 2025 . CodeCrash : Stress Testing LLM Reasoning under Structural and Semantic Perturbations. CodeCrash : Stress testing LLM reasoning under structural and semantic perturbations. APACrefURL https://arxiv.org/abs/2504.14119 APACrefURL

work page arXiv 2025

[26] [26]

, Katz, S

lavon2025egcfg APACrefauthors Lavon, B. , Katz, S. \ Wolf, L. APACrefauthors \ 2025 . Execution Guided Line-by-Line Code Generation. Execution guided line-by-line code generation. APACrefURL https://arxiv.org/abs/2506.10948 APACrefURL

work page arXiv 2025

[27] [27]

, Xia, C S

liu2023evalplus APACrefauthors Liu, J. , Xia, C S. , Wang, Y. \ Zhang, L. APACrefauthors \ 2023 . Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation . Advances in Neural Informat...

2023

[28] [28]

Self-Refine: Iterative Refinement with Self-Feedback

madaan2023selfrefine APACrefauthors Madaan, A. , Tandon, N. , Gupta, P. , Hallinan, S. , Gao, L. , Wiegreffe, S. , Alon, U. , Dziri, N. , Prabhumoye, S. , Yang, Y. , Gupta, S. , Majumder, B P. , Hermann, K. , Welleck, S. , Yazdanbakhsh, A. \ Clark, P. APACrefauthors \ 2023 . Self-Refine : Iterative Refinement with Self-Feedback Self-Refine : Iterative ref...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

\ Spanos, A

mayospanos2006severe APACrefauthors Mayo, D G. \ Spanos, A. APACrefauthors \ 2006 . Severe Testing as a Basic Concept in a Neyman--Pearson Philosophy of Induction Severe testing as a basic concept in a Neyman--Pearson philosophy of induction . The British Journal for the Philosophy of Science 57 2 323--357 . Advance Access published April 11, 2006 APACref...

work page doi:10.1093/bjps/axl003 2006

[30] [30]

APACrefauthors \ 1947

mcnemar1947note APACrefauthors McNemar, Q. APACrefauthors \ 1947 . Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages Note on the sampling error of the difference between correlated proportions or percentages . Psychometrika 12 2 153--157 . APACrefDOI doi:10.1007/BF02295996 APACrefDOI

work page doi:10.1007/bf02295996 1947

[31] [31]

, Inala, J P

olausson2024selfrepair APACrefauthors Olausson, T X. , Inala, J P. , Wang, C. , Gao, J. \ Solar-Lezama, A. APACrefauthors \ 2024 . Is Self-Repair a Silver Bullet for Code Generation? Is self-repair a silver bullet for code generation? Proceedings of the Twelfth International Conference on Learning Representations (ICLR). Proceedings of the twelfth interna...

2024

[32] [32]

, Bowman, S R

panickssery2024selfrecognition APACrefauthors Panickssery, A. , Bowman, S R. \ Feng, S. APACrefauthors \ 2024 . LLM Evaluators Recognize and Favor Their Own Generations LLM evaluators recognize and favor their own generations . Advances in Neural Information Processing Systems 37 (NeurIPS 2024). Advances in neural information processing systems 37 (neurip...

work page doi:10.52202/079017-2197 2024

[33] [33]

APACrefauthors \ 1959

popper1959logic APACrefauthors Popper, K R. APACrefauthors \ 1959 . The Logic of Scientific Discovery The logic of scientific discovery . London Hutchinson . English translation of Logik der Forschung (1934); degree of corroboration C(h,e) in Appendix *ix

1959

[34] [34]

APACrefauthors \ 1963

popper1963conjectures APACrefauthors Popper, K R. APACrefauthors \ 1963 . Conjectures and Refutations: The Growth of Scientific Knowledge Conjectures and refutations: The growth of scientific knowledge . London Routledge and Kegan Paul . Verisimilitude/truthlikeness in ch. 10 and Addenda

1963

[35] [35]

\ Albassam, E

rizqullah2026tdp APACrefauthors Rizqullah, M. \ Albassam, E. APACrefauthors \ 2026 . Model-Agnostic Empirical Evaluation of Test-Driven Prompt Engineering on Improving Accuracy and Efficiency in Large Language Models Python Code Generation Model-agnostic empirical evaluation of test-driven prompt engineering on improving accuracy and efficiency in large l...

work page doi:10.1109/access.2026.3662817 2026

[36] [36]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

sclar2024formatting APACrefauthors Sclar, M. , Choi, Y. , Tsvetkov, Y. \ Suhr, A. APACrefauthors \ 2024 . Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design, or: How I Learned to Start Worrying about Prompt Formatting Quantifying language models' sensitivity to spurious features in prompt design, or: How I learned to start worr...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Reflexion: Language Agents with Verbal Reinforcement Learning

shinn2023reflexion APACrefauthors Shinn, N. , Cassano, F. , Berman, E. , Gopinath, A. , Narasimhan, K. \ Yao, S. APACrefauthors \ 2023 . Reflexion : Language Agents with Verbal Reinforcement Learning Reflexion : Language agents with verbal reinforcement learning . Advances in Neural Information Processing Systems (NeurIPS). Advances in neural information ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

, Bansal, H

singhi2025solve APACrefauthors Singhi, N. , Bansal, H. , Hosseini, A. , Grover, A. , Chang, K W. , Rohrbach, M. \ Rohrbach, A. APACrefauthors \ 2025 . When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning When to solve, when to verify: Compute-optimal problem solving and generative verification for LL...

work page arXiv 2025

[39] [39]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

snell2024testtime APACrefauthors Snell, C. , Lee, J. , Xu, K. \ Kumar, A. APACrefauthors \ 2025 . Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters Scaling LLM test-time compute optimally can be more effective than scaling model parameters . International Conference on Learning Representations (ICLR). Internationa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

, Zhang, H

song2025mindthegap APACrefauthors Song, Y. , Zhang, H. , Eisenach, C. , Kakade, S M. , Foster, D. \ Ghai, U. APACrefauthors \ 2025 . Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models Mind the gap: Examining the self-improvement capabilities of large language models . International Conference on Learning Representations (IC...

work page arXiv 2025

[41] [41]

, Valmeekam, K

stechly2025selfverification APACrefauthors Stechly, K. , Valmeekam, K. \ Kambhampati, S. APACrefauthors \ 2025 . On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks On the self-verification limitations of large language models on reasoning and planning tasks . International Conference on Learning Representations (...

work page arXiv 2025

[42] [42]

, Gautam, A

takallou2026llms APACrefauthors Takallou, M A. , Gautam, A. \ Shirolkar, P. APACrefauthors \ 2026 . LLMs for End-to-End Machine Learning: A Comprehensive Review LLMs for end-to-end machine learning: A comprehensive review . IEEE Access 14 47717--47738 . APACrefDOI doi:10.1109/ACCESS.2026.3678429 APACrefDOI

work page doi:10.1109/access.2026.3678429 2026

[43] [43]

tian2024debugbench APACrefauthors Tian, R. , Ye, Y. , Qin, Y. , Cong, X. , Lin, Y. , Pan, Y. , Wu, Y. , Hui, H. , Liu, W. , Liu, Z. \ Sun, M. APACrefauthors \ 2024 . DebugBench : Evaluating Debugging Capability of Large Language Models DebugBench : Evaluating debugging capability of large language models . Findings of the Association for Computational Lin...

work page arXiv 2024

[44] [44]

\ Kahneman, D

tversky1974judgment APACrefauthors Tversky, A. \ Kahneman, D. APACrefauthors \ 1974 . Judgment under Uncertainty: Heuristics and Biases Judgment under uncertainty: Heuristics and biases . Science 185 4157 1124--1131 . APACrefDOI doi:10.1126/science.185.4157.1124 APACrefDOI

work page doi:10.1126/science.185.4157.1124 1974

[45] [45]

, Marquez, M

valmeekam2023self APACrefauthors Valmeekam, K. , Marquez, M. \ Kambhampati, S. APACrefauthors \ 2023 . Can Large Language Models Really Improve by Self-critiquing Their Own Plans? Can Large Language Models Really Improve by Self-critiquing Their Own Plans? APACrefURL https://arxiv.org/abs/2310.08118 APACrefURL Preprint. arXiv:2310.08118, 12 Oct 2023

work page arXiv 2023

[46] [46]

, Cassano, F

wang2025plansearch APACrefauthors Wang, E. , Cassano, F. , Wu, C. , Bai, Y. , Song, W. , Nath, V. , Han, Z. , Hendryx, S. , Yue, S. \ Zhang, H. APACrefauthors \ 2025 . Planning in Natural Language Improves LLM Search for Code Generation Planning in natural language improves LLM search for code generation . International Conference on Learning Representati...

work page arXiv 2025

[47] [47]

, Huang, Y

wang2024software APACrefauthors Wang, J. , Huang, Y. , Chen, C. , Liu, Z. , Wang, S. \ Wang, Q. APACrefauthors \ 2024 . Software Testing With Large Language Models: Survey, Landscape, and Vision Software testing with large language models: Survey, landscape, and vision . IEEE Transactions on Software Engineering 50 4 911--936 . APACrefDOI doi:10.1109/TSE....

work page doi:10.1109/tse.2024.3368208 2024

[48] [48]

, Zhou, Y

yang2024cotton APACrefauthors Yang, G. , Zhou, Y. , Chen, X. , Zhang, X. , Zhuo, T Y. \ Chen, T. APACrefauthors \ 2024 . Chain-of-Thought in Neural Code Generation: From and for Lightweight Language Models Chain-of-thought in neural code generation: From and for lightweight language models . IEEE Transactions on Software Engineering 50 9 2437--2457 . APAC...

work page doi:10.1109/tse.2024.3440503 2024

[49] [49]

, Wang, D

zhang2023algo APACrefauthors Zhang, K. , Wang, D. , Xia, J. , Wang, W Y. \ Li, L. APACrefauthors \ 2023 . ALGO : Synthesizing Algorithmic Programs with LLM -Generated Oracle Verifiers ALGO : Synthesizing algorithmic programs with LLM -generated oracle verifiers . Advances in Neural Information Processing Systems 36 (NeurIPS). Advances in neural informatio...

work page arXiv 2023

[50] [50]

, Decugis, J

zheng2025multiturn APACrefauthors Zheng, K. , Decugis, J. , Gehring, J. , Cohen, T. , Negrevergne, B. \ Synnaeve, G. APACrefauthors \ 2025 . What Makes Large Language Models Reason in (Multi-Turn) Code Generation? What makes large language models reason in (multi-turn) code generation? International Conference on Learning Representations (ICLR). Internati...

work page arXiv 2025

[51] [51]

, Zhang, G

zheng2024opencodeinterpreter APACrefauthors Zheng, T. , Zhang, G. , Shen, T. , Liu, X. , Lin, B Y. , Fu, J. , Chen, W. \ Yue, X. APACrefauthors \ 2024 . OpenCodeInterpreter : Integrating Code Generation with Execution and Refinement OpenCodeInterpreter : Integrating code generation with execution and refinement . Findings of the Association for Computatio...

work page doi:10.18653/v1/2024.findings-acl.762 2024

[52] [52]

, Ning, K

zheng2025understanding APACrefauthors Zheng, Z. , Ning, K. , Zhong, Q. , Chen, J. , Chen, W. , Guo, L. , Wang, W. \ Wang, Y. APACrefauthors \ 2025 . Towards an Understanding of Large Language Models in Software Engineering Tasks Towards an understanding of large language models in software engineering tasks . Empirical Software Engineering 30 2 50 . APACr...

work page doi:10.1007/s10664-024-10602-0 2025

[53] [53]

, Wang, Z

zhong2024ldb APACrefauthors Zhong, L. , Wang, Z. \ Shang, J. APACrefauthors \ 2024 . Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step Debug like a human: A large language model debugger via verifying runtime execution step by step . Findings of the Association for Computational Linguistics: ACL 2024 Findings...

work page doi:10.18653/v1/2024.findings-acl.49 2024

[54] [54]

, Al-Hitmi, M

zubair2025repair APACrefauthors Zubair, F. , Al-Hitmi, M. \ Catal, C. APACrefauthors \ 2025 . The Use of Large Language Models for Program Repair The use of large language models for program repair . Computer Standards & Interfaces 93 103951 . APACrefDOI doi:10.1016/j.csi.2024.103951 APACrefDOI

work page doi:10.1016/j.csi.2024.103951 2025