pith. machine review for the scientific record. sign in

arxiv: 2604.12214 · v1 · submitted 2026-04-14 · 💻 cs.SE

Recognition: unknown

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.SE
keywords Chain-of-ThoughtLLM4Coderobustnessstructural anchorsreasoning trajectoriesperturbationsuncertaintycode generation
0
0 comments X

The pith

CoT for code generation stays robust only when perturbations spare three key structural commitment points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how Chain-of-Thought prompting affects the stability of large language models when generating code from perturbed inputs. It proposes that CoT helps only if changes do not disrupt specific points where the model commits to reasoning structures that lead to code. Using large-scale tests on six models with two benchmarks, the authors apply character, word, and sentence perturbations to task descriptions and track uncertainty throughout the output generation. They find that CoT and non-CoT approaches show different vulnerabilities, with failures often linked to three types of changes in the reasoning path. Early detection of rising uncertainty can pinpoint where these disruptions start.

Core claim

CoT prompting does not provide uniform gains in performance or robustness for code generation tasks. Instead, its effectiveness depends on whether input perturbations destabilize structurally sensitive commitment points in the trajectory from reasoning to code output. The study identifies three anchors at which this occurs and shows that their disruption produces characteristic deformations in the generation trace, which can be detected through token-level uncertainty measurements.

What carries the argument

Three structural anchors—reasoning-code transition, symbolic commitment, and algorithmic articulation—that mark commitment points in the reasoning trajectory where perturbations can trigger specific deformations and failures.

If this is right

  • CoT benefits are not uniform but contingent on model family, task structure, and prompt explicitness.
  • CoT and No-CoT prompting exhibit distinct robustness profiles, with different perturbation families triggering different failure modes.
  • Three recurrent trajectory deformations—lengthening, branching, and simplification—systematically emerge when perturbations interact with the structural anchors.
  • Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generation systems could monitor uncertainty around these anchors in real time to decide whether to use CoT or fall back to direct prompting.
  • The framework of structural anchors and trajectory deformations could be tested on non-code reasoning tasks such as mathematical proofs or logical deduction.
  • Prompt design might deliberately reinforce the three anchors to reduce sensitivity to realistic input noise.
  • New robustness benchmarks could be built by systematically targeting these commitment points rather than applying random changes.

Load-bearing premise

The three structural anchors capture the primary points of fragility and the chosen character, word, and sentence perturbations adequately represent realistic input variations in code generation tasks.

What would settle it

Observing a new set of models or benchmarks where perturbing the identified anchors does not increase failure rates or where early uncertainty spikes fail to precede trajectory deformations would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.12214 by Armstrong Foundjem, Da Song, Foutse Khomh, Heng Li, Yang Liu.

Figure 1
Figure 1. Figure 1: Methodology overview and research questions (RQs) mapping. Steps are labeled A–F. We evaluate CoT and No-CoT prompting on MHPP and BigCodeBench under controlled character (C), word (W), and sentence (S) perturbations. Execution-based metrics (Pass@k, RD) address RQ1 (performance impact of CoT) and RQ2 (robustness under perturbations). Token-level uncertainty signals (entropy, probability differential) supp… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the first uncertainty spike position (token scale). The x-axis denotes the normalized position of the first major uncertainty spike S/T along the generated trajectory, aggregated across all models, datasets, temperatures, and perturbation conditions. Under CoT, spikes are broadly distributed across the trajectory with a long tail toward later tokens, whereas under No-CoT they are heavily fr… view at source ↗
Figure 3
Figure 3. Figure 3: Alignment between the first uncertainty spike and structural anchors. Each panel shows the distribution of normalized distances ∆ = (S−Ak)/T between the first uncertainty spike S and a structural anchor Ak. Vertical dashed lines indicate the anchor location (∆ = 0). Negative values correspond to spikes occurring before the anchor, and positive values to spikes occurring after. Under CoT, uncertainty spikes… view at source ↗
Figure 4
Figure 4. Figure 4: Association between structural anchors and deformation patterns under CoT. Each cell reports the mean magnitude of the first uncertainty spike associated with a given deformation pattern at a specific anchor. Higher values indicate stronger uncertainty concentration. Lengthening is most strongly associated with early instability near the reasoning–code transition (A1), while branching shows elevated uncert… view at source ↗
read the original abstract

Chain-of-Thought (CoT) prompting is widely used to elicit explicit reasoning from large language models for code (LLM4Code). However, its impact on robustness and the stability of reasoning trajectories under realistic input perturbations remains poorly understood. Prior work has largely evaluated CoT through final correctness, leaving a critical gap in understanding how CoT reshapes internal uncertainty dynamics and why it sometimes harms rather than helps code generation. We suggest that CoT is not uniformly beneficial; instead, its robustness depends on whether perturbations destabilize structurally sensitive commitment points along the reasoning-to-code trajectory. We conduct a controlled, large-scale empirical study of CoT across six models and two code benchmarks (MHPP and BigCodeBench), subjecting task docstrings to systematic character-, word-, and sentence-level perturbations. We instrument full generation traces with token-level uncertainty and define three novel structural anchors: reasoning-code transition, symbolic commitment, and algorithmic articulation. Findings: (1) CoT does not yield uniform performance or robustness gains: its benefits are contingent on model family, task structure, and prompt explicitness. (2) CoT and No-CoT exhibit distinct robustness profiles, with different perturbation families triggering different failure modes. (3) We identify three recurrent trajectory deformations--Lengthening, Branching, and Simplification--that systematically emerge when perturbations interact with structural anchors and explain failure patterns. (4) Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors. These results provide a unified explanation for CoT's mixed performance and suggest design principles for building more robust reasoning-based code generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper conducts a large-scale empirical study of Chain-of-Thought (CoT) prompting for code generation across six LLMs and two benchmarks (MHPP, BigCodeBench). It subjects docstrings to character-, word-, and sentence-level perturbations, instruments token-level uncertainty in full traces, and defines three structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation). The central claim is that CoT robustness is not uniform but depends on whether perturbations destabilize these anchors, producing recurrent trajectory deformations (Lengthening, Branching, Simplification) that explain mixed performance; early uncertainty is proposed as a diagnostic for instability.

Significance. If the empirical links hold, the work supplies a mechanistic account of why CoT sometimes degrades rather than improves code generation, shifting evaluation from final accuracy to internal trajectory stability. The controlled perturbation design and uncertainty instrumentation are strengths that enable concrete, falsifiable observations on fragility points and could inform more robust prompting strategies in LLM4Code.

major comments (3)
  1. [§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.
  2. [§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.
  3. [§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.
minor comments (3)
  1. [Abstract] The abstract and §2 use the term 'structural anchors' without an early, compact definition or diagram; readers must reach §3.2 to understand the three specific points.
  2. [§4] Table or figure captions for the deformation examples (Lengthening, Branching, Simplification) should explicitly state the perturbation type and model that produced each illustrated trace.
  3. [§2] The related-work section should cite recent studies on uncertainty estimation in LLM reasoning (e.g., token-level entropy methods) to situate the instrumentation approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor and generalizability of our claims. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.

    Authors: We agree that the structural anchors were derived from systematic observation of the instrumented traces rather than fully pre-specified in advance. In the revision, we will update §3.2 to ground the three anchors in a priori hypotheses drawn from prior literature on LLM reasoning steps and code generation trajectories. We will also add a validation analysis on held-out data demonstrating their predictive power for failure modes, including explicit comparisons against alternative candidate points (e.g., mid-reasoning transitions or output-only steps) to establish their relative primacy. revision: yes

  2. Referee: [§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.

    Authors: We acknowledge this gap in statistical reporting. The revised §4 will explicitly report sample sizes per condition (tasks per model, benchmark, and perturbation type). We will incorporate appropriate statistical tests (paired t-tests and ANOVA for performance differences; chi-squared tests for failure mode distributions), 95% confidence intervals, and multiple-comparison corrections (Bonferroni). These additions will control for model scale and substantiate the claims of distinct robustness profiles. revision: yes

  3. Referee: [§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.

    Authors: We will revise §3.1 to include regression-based comparisons showing the additional variance explained by our structured perturbations over simpler baselines such as prompt length and lexical diversity. For naturalistic variations, we will add a discussion in §5 relating our controlled findings to real-world prompt changes observed in code repositories and developer forums. A limited comparison using naturally varied prompts from BigCodeBench will be included where data permits; otherwise, we will explicitly frame the absence of a full naturalistic study as a limitation and future direction. revision: partial

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on independent observations

full rationale

The paper conducts a controlled empirical study by applying systematic perturbations to task docstrings, instrumenting token-level uncertainty in generation traces, defining three structural anchors from those traces, and reporting observed trajectory deformations (Lengthening, Branching, Simplification) and their correlations with failure modes. No equations, fitted parameters, or predictions appear that reduce by construction to the inputs; the anchors are introduced as novel definitions rather than derived quantities, and results are presented as contingent empirical findings across models and benchmarks rather than forced by self-definition or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims depend on domain assumptions about what counts as realistic perturbations and reliable uncertainty signals, plus the novel but unproven framing of structural anchors as load-bearing for robustness.

axioms (2)
  • domain assumption Character-, word-, and sentence-level perturbations to docstrings represent realistic input variations in code generation tasks.
    Invoked to justify the experimental design as relevant to practice.
  • domain assumption Token-level uncertainty measurements serve as a valid diagnostic for localizing instability around structural anchors.
    Used to support the claim that early-stage uncertainty predicts trajectory deformations.
invented entities (1)
  • Structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation) no independent evidence
    purpose: To identify sensitive commitment points where perturbations cause reasoning fragility.
    Newly defined concepts introduced to explain observed failure patterns.

pith-pipeline@v0.9.0 · 5605 in / 1375 out tokens · 90822 ms · 2026-05-10T15:13:59.241398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapinet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

  2. [2]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

  3. [3]

    Qwen2. 5-coder technical report,

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Danget al., “Qwen2. 5-coder technical report,”CoRR, 2024

  4. [4]

    Adversarial attack clas- sification and robustness testing for large language models for code,

    Y . Liu, A. Foundjem, F. Khomh, and H. Li, “Adversarial attack clas- sification and robustness testing for large language models for code,” Empirical Software Engineering, vol. 30, no. 5, p. 154, 2025

  5. [5]

    arXiv preprint arXiv:2212.10264 , year=

    S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V . Kumar, S. Tan, B. Ray, P. Bhatiaet al., “Recode: Robustness evaluation of code generation models,”arXiv preprint arXiv:2212.10264, 2022

  6. [6]

    How important are good method names in neural code generation? a model robustness perspective,

    G. Yang, Y . Zhou, W. Yang, T. Yue, X. Chen, and T. Chen, “How important are good method names in neural code generation? a model robustness perspective,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 3, pp. 1–35, 2024

  7. [7]

    On the robustness of code generation techniques: An empirical study on github copilot,

    A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2149–2160

  8. [8]

    Adversarial robustness of deep code comment generation,

    Y . Zhou, X. Zhang, J. Shen, T. Han, T. Chen, and H. Gall, “Adversarial robustness of deep code comment generation,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–30, 2022

  9. [9]

    Robustness evaluation of code generation systems via concretizing instructions,

    M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Robustness evaluation of code generation systems via concretizing instructions,”Information and Software Technology, vol. 179, p. 107645, 2025. 24

  10. [10]

    Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,

    L. Zhong and Z. Wang, “Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 19, 2024, pp. 21 841–21 849

  11. [11]

    Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 528–50 652, 2024

  12. [12]

    Large language model-based agents for software engineering: A sur- vey,

    J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A sur- vey,”ACM Transactions on Software Engineering and Methodology, 2024

  13. [13]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  14. [14]

    Large lan- guage models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  15. [15]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022

  16. [16]

    Chain-of-verification reduces hallucination in large language models,

    S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578

  17. [17]

    Codecot: Tackling code syntax errors in cot reasoning for code generation,

    D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,”arXiv e-prints, pp. arXiv– 2308, 2023

  18. [18]

    Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?

    Z. Zhou, R. Tao, J. Zhu, Y . Luo, Z. Wang, and B. Han, “Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?”Advances in Neural Information Processing Systems, vol. 37, pp. 123 846–123 910, 2024

  19. [19]

    Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,

    Z. Yin, Q. Sun, Q. Guo, Z. Zeng, X. Li, J. Dai, Q. Cheng, X.-J. Huang, and X. Qiu, “Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2401–2416

  20. [20]

    Large language models for software engi- neering: A systematic literature review,

    X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

  21. [21]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  22. [22]

    An empirical study on the code refactoring capability of large language models,

    J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024

  23. [23]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

  24. [24]

    Measuring Coding Challenge Competence With APPS

    D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Songet al., “Measuring coding challenge competence with apps,”arXiv preprint arXiv:2105.09938, 2021

  25. [25]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  26. [26]

    Structured chain-of-thought prompting for code generation,

    J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025

  27. [27]

    Towards bet- ter chain-of-thought prompting strategies: A survey,

    Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen, “Towards bet- ter chain-of-thought prompting strategies: A survey,”arXiv preprint arXiv:2310.04959, 2023

  28. [28]

    Ryan Liu, Jiayi Geng, Addison J

    Z. Li, Y . Chang, and Y . Wu, “Think-bench: Evaluating thinking effi- ciency and chain-of-thought quality of large reasoning models,”arXiv preprint arXiv:2505.22113, 2025

  29. [29]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023

  30. [30]

    A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,

    O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, and A. Majumdar, “A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,”ACM Computing Surveys, 2025

  31. [31]

    24 Published in Transactions on Machine Learning Research (04/2026) A Appendix Contents A.1 Reward Design and PPO Stabilization Sensitivity

    Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty- guided chain-of-thought for code generation with llms,”arXiv preprint arXiv:2503.15341, 2025

  32. [32]

    Beyond accuracy: Behavioral testing of nlp models with checklist,

    M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4902–4912

  33. [33]

    Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

    K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” in Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, 2023, pp. 57–68

  34. [34]

    Codebert: A pre-trained model for programming and natural languages,

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jianget al., “Codebert: A pre-trained model for programming and natural languages,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 1536–1547

  35. [35]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fuet al., “Graphcodebert: Pre-training code repre- sentations with data flow,”arXiv preprint arXiv:2009.08366, 2020

  36. [36]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,”arXiv preprint arXiv:2203.13474, 2022

  37. [37]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  38. [38]

    Star: Bootstrapping reasoning with reasoning,

    E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,”Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022

  39. [39]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023

  40. [40]

    and Liang, J

    C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, L. Fei-Fei, F. Xia, and B. Ichter, “Chain of code: Reason- ing with a language model-augmented code emulator,”arXiv preprint arXiv:2312.04474, 2023

  41. [41]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

  42. [42]

    Codeattack: Code-based adversarial attacks for pre-trained programming language models,

    A. Jha and C. K. Reddy, “Codeattack: Code-based adversarial attacks for pre-trained programming language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 14 892–14 900

  43. [43]

    Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation

    J. Dai, J. Lu, Y . Feng, G. Zeng, R. Ruan, M. Cheng, D. Huang, H. Tan, and Z. Guo, “Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation,”arXiv preprint arXiv:2405.11430, 2024

  44. [44]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paulet al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

  45. [45]

    Instruction tuning for secure code generation

    J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,”arXiv preprint arXiv:2402.09497, 2024

  46. [46]

    The meaning and use of the area under a receiver operating characteristic (roc) curve

    J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982

  47. [47]

    The relationship between precision-recall and roc curves,

    J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 233–240

  48. [48]

    The proof and measurement of association between two things

    C. Spearman, “The proof and measurement of association between two things.” 1961

  49. [49]

    Multivariate extensions of spearman’s rho and related statistics,

    F. Schmid and R. Schmidt, “Multivariate extensions of spearman’s rho and related statistics,”Statistics & probability letters, vol. 77, no. 4, pp. 407–416, 2007

  50. [50]

    Multivariate Spearman’sρfor aggregating ranks using copulas,

    J. Bed ˝o and C. S. Ong, “Multivariate Spearman’sρfor aggregating ranks using copulas,”Journal of Machine Learning Research, vol. 17, no. 201, pp. 1–30, 2016

  51. [51]

    Large language models can be easily distracted by irrelevant context,

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227