arxiv: 2604.12214 · v1 · submitted 2026-04-14 · 💻 cs.SE

Recognition: unknown

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

Yang Liu , Da Song , Armstrong Foundjem , Heng Li , Foutse Khomh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.SE

keywords Chain-of-ThoughtLLM4Coderobustnessstructural anchorsreasoning trajectoriesperturbationsuncertaintycode generation

0 comments

The pith

CoT for code generation stays robust only when perturbations spare three key structural commitment points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how Chain-of-Thought prompting affects the stability of large language models when generating code from perturbed inputs. It proposes that CoT helps only if changes do not disrupt specific points where the model commits to reasoning structures that lead to code. Using large-scale tests on six models with two benchmarks, the authors apply character, word, and sentence perturbations to task descriptions and track uncertainty throughout the output generation. They find that CoT and non-CoT approaches show different vulnerabilities, with failures often linked to three types of changes in the reasoning path. Early detection of rising uncertainty can pinpoint where these disruptions start.

Core claim

CoT prompting does not provide uniform gains in performance or robustness for code generation tasks. Instead, its effectiveness depends on whether input perturbations destabilize structurally sensitive commitment points in the trajectory from reasoning to code output. The study identifies three anchors at which this occurs and shows that their disruption produces characteristic deformations in the generation trace, which can be detected through token-level uncertainty measurements.

What carries the argument

Three structural anchors—reasoning-code transition, symbolic commitment, and algorithmic articulation—that mark commitment points in the reasoning trajectory where perturbations can trigger specific deformations and failures.

If this is right

CoT benefits are not uniform but contingent on model family, task structure, and prompt explicitness.
CoT and No-CoT prompting exhibit distinct robustness profiles, with different perturbation families triggering different failure modes.
Three recurrent trajectory deformations—lengthening, branching, and simplification—systematically emerge when perturbations interact with the structural anchors.
Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generation systems could monitor uncertainty around these anchors in real time to decide whether to use CoT or fall back to direct prompting.
The framework of structural anchors and trajectory deformations could be tested on non-code reasoning tasks such as mathematical proofs or logical deduction.
Prompt design might deliberately reinforce the three anchors to reduce sensitivity to realistic input noise.
New robustness benchmarks could be built by systematically targeting these commitment points rather than applying random changes.

Load-bearing premise

The three structural anchors capture the primary points of fragility and the chosen character, word, and sentence perturbations adequately represent realistic input variations in code generation tasks.

What would settle it

Observing a new set of models or benchmarks where perturbing the identified anchors does not increase failure rates or where early uncertainty spikes fail to precede trajectory deformations would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.12214 by Armstrong Foundjem, Da Song, Foutse Khomh, Heng Li, Yang Liu.

**Figure 1.** Figure 1: Methodology overview and research questions (RQs) mapping. Steps are labeled A–F. We evaluate CoT and No-CoT prompting on MHPP and BigCodeBench under controlled character (C), word (W), and sentence (S) perturbations. Execution-based metrics (Pass@k, RD) address RQ1 (performance impact of CoT) and RQ2 (robustness under perturbations). Token-level uncertainty signals (entropy, probability differential) supp… view at source ↗

**Figure 2.** Figure 2: Distribution of the first uncertainty spike position (token scale). The x-axis denotes the normalized position of the first major uncertainty spike S/T along the generated trajectory, aggregated across all models, datasets, temperatures, and perturbation conditions. Under CoT, spikes are broadly distributed across the trajectory with a long tail toward later tokens, whereas under No-CoT they are heavily fr… view at source ↗

**Figure 3.** Figure 3: Alignment between the first uncertainty spike and structural anchors. Each panel shows the distribution of normalized distances ∆ = (S−Ak)/T between the first uncertainty spike S and a structural anchor Ak. Vertical dashed lines indicate the anchor location (∆ = 0). Negative values correspond to spikes occurring before the anchor, and positive values to spikes occurring after. Under CoT, uncertainty spikes… view at source ↗

**Figure 4.** Figure 4: Association between structural anchors and deformation patterns under CoT. Each cell reports the mean magnitude of the first uncertainty spike associated with a given deformation pattern at a specific anchor. Higher values indicate stronger uncertainty concentration. Lengthening is most strongly associated with early instability near the reasoning–code transition (A1), while branching shows elevated uncert… view at source ↗

read the original abstract

Chain-of-Thought (CoT) prompting is widely used to elicit explicit reasoning from large language models for code (LLM4Code). However, its impact on robustness and the stability of reasoning trajectories under realistic input perturbations remains poorly understood. Prior work has largely evaluated CoT through final correctness, leaving a critical gap in understanding how CoT reshapes internal uncertainty dynamics and why it sometimes harms rather than helps code generation. We suggest that CoT is not uniformly beneficial; instead, its robustness depends on whether perturbations destabilize structurally sensitive commitment points along the reasoning-to-code trajectory. We conduct a controlled, large-scale empirical study of CoT across six models and two code benchmarks (MHPP and BigCodeBench), subjecting task docstrings to systematic character-, word-, and sentence-level perturbations. We instrument full generation traces with token-level uncertainty and define three novel structural anchors: reasoning-code transition, symbolic commitment, and algorithmic articulation. Findings: (1) CoT does not yield uniform performance or robustness gains: its benefits are contingent on model family, task structure, and prompt explicitness. (2) CoT and No-CoT exhibit distinct robustness profiles, with different perturbation families triggering different failure modes. (3) We identify three recurrent trajectory deformations--Lengthening, Branching, and Simplification--that systematically emerge when perturbations interact with structural anchors and explain failure patterns. (4) Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors. These results provide a unified explanation for CoT's mixed performance and suggest design principles for building more robust reasoning-based code generators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps CoT fragility in code LLMs to three structural anchors and deformation patterns via a controlled perturbation study, but the anchors risk being post-hoc fits rather than proven primary drivers.

read the letter

The paper's main point is that CoT prompting for code does not improve robustness across the board. Instead, small changes to docstrings destabilize the generation at specific points in the reasoning-to-code path, leading to predictable trajectory changes like lengthening, branching, or simplification. They track this with token uncertainty and tie it to three anchors: reasoning-code transition, symbolic commitment, and algorithmic articulation. This is a step past just scoring final outputs and gives a more mechanistic view of why CoT helps some models and tasks but not others. Running the setup on six models and two benchmarks with character-, word-, and sentence-level edits is a solid scale for the claims about non-uniform effects and distinct robustness profiles between CoT and no-CoT conditions. Early uncertainty as a signal for where instability starts is a practical observation that could inform monitoring in practice. The definitions of the anchors and deformations are new enough to organize prior mixed findings on CoT in LLM4Code. The soft spots sit in the validation. The anchors are identified from the traces after perturbations, which leaves open whether they would predict failures on fresh data or simply label what already broke. The chosen perturbations are clean but artificial; they may not match the kinds of variations that actually occur in code tasks, such as missing requirements or shifted constraints. Without explicit comparisons showing these anchors explain more than simpler factors like overall length or lexical stats, the explanatory claim stays provisional. The abstract flags a large-scale study, but the strength rests on whether the methods section has proper controls and replication checks. This is for researchers focused on prompt robustness and reasoning stability in code models. Anyone evaluating or designing CoT for code generation would find the concepts and diagnostic angle worth testing, even if they adapt the anchors. The work shows honest engagement with the mixed CoT results and has enough empirical grounding to merit peer review, though referees should press on predictive validation of the anchors and more realistic perturbation tests. I would send it for review.

Referee Report

3 major / 3 minor

Summary. The paper conducts a large-scale empirical study of Chain-of-Thought (CoT) prompting for code generation across six LLMs and two benchmarks (MHPP, BigCodeBench). It subjects docstrings to character-, word-, and sentence-level perturbations, instruments token-level uncertainty in full traces, and defines three structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation). The central claim is that CoT robustness is not uniform but depends on whether perturbations destabilize these anchors, producing recurrent trajectory deformations (Lengthening, Branching, Simplification) that explain mixed performance; early uncertainty is proposed as a diagnostic for instability.

Significance. If the empirical links hold, the work supplies a mechanistic account of why CoT sometimes degrades rather than improves code generation, shifting evaluation from final accuracy to internal trajectory stability. The controlled perturbation design and uncertainty instrumentation are strengths that enable concrete, falsifiable observations on fragility points and could inform more robust prompting strategies in LLM4Code.

major comments (3)

[§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.
[§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.
[§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.

minor comments (3)

[Abstract] The abstract and §2 use the term 'structural anchors' without an early, compact definition or diagram; readers must reach §3.2 to understand the three specific points.
[§4] Table or figure captions for the deformation examples (Lengthening, Branching, Simplification) should explicitly state the perturbation type and model that produced each illustrated trace.
[§2] The related-work section should cite recent studies on uncertainty estimation in LLM reasoning (e.g., token-level entropy methods) to situate the instrumentation approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor and generalizability of our claims. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.

Authors: We agree that the structural anchors were derived from systematic observation of the instrumented traces rather than fully pre-specified in advance. In the revision, we will update §3.2 to ground the three anchors in a priori hypotheses drawn from prior literature on LLM reasoning steps and code generation trajectories. We will also add a validation analysis on held-out data demonstrating their predictive power for failure modes, including explicit comparisons against alternative candidate points (e.g., mid-reasoning transitions or output-only steps) to establish their relative primacy. revision: yes
Referee: [§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.

Authors: We acknowledge this gap in statistical reporting. The revised §4 will explicitly report sample sizes per condition (tasks per model, benchmark, and perturbation type). We will incorporate appropriate statistical tests (paired t-tests and ANOVA for performance differences; chi-squared tests for failure mode distributions), 95% confidence intervals, and multiple-comparison corrections (Bonferroni). These additions will control for model scale and substantiate the claims of distinct robustness profiles. revision: yes
Referee: [§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.

Authors: We will revise §3.1 to include regression-based comparisons showing the additional variance explained by our structured perturbations over simpler baselines such as prompt length and lexical diversity. For naturalistic variations, we will add a discussion in §5 relating our controlled findings to real-world prompt changes observed in code repositories and developer forums. A limited comparison using naturally varied prompts from BigCodeBench will be included where data permits; otherwise, we will explicitly frame the absence of a full naturalistic study as a limitation and future direction. revision: partial

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on independent observations

full rationale

The paper conducts a controlled empirical study by applying systematic perturbations to task docstrings, instrumenting token-level uncertainty in generation traces, defining three structural anchors from those traces, and reporting observed trajectory deformations (Lengthening, Branching, Simplification) and their correlations with failure modes. No equations, fitted parameters, or predictions appear that reduce by construction to the inputs; the anchors are introduced as novel definitions rather than derived quantities, and results are presented as contingent empirical findings across models and benchmarks rather than forced by self-definition or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims depend on domain assumptions about what counts as realistic perturbations and reliable uncertainty signals, plus the novel but unproven framing of structural anchors as load-bearing for robustness.

axioms (2)

domain assumption Character-, word-, and sentence-level perturbations to docstrings represent realistic input variations in code generation tasks.
Invoked to justify the experimental design as relevant to practice.
domain assumption Token-level uncertainty measurements serve as a valid diagnostic for localizing instability around structural anchors.
Used to support the claim that early-stage uncertainty predicts trajectory deformations.

invented entities (1)

Structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation) no independent evidence
purpose: To identify sensitive commitment points where perturbations cause reasoning fragility.
Newly defined concepts introduced to explain observed failure patterns.

pith-pipeline@v0.9.0 · 5605 in / 1375 out tokens · 90822 ms · 2026-05-10T15:13:59.241398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapinet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review arXiv 2024
[3]

Qwen2. 5-coder technical report,

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Danget al., “Qwen2. 5-coder technical report,”CoRR, 2024

2024
[4]

Adversarial attack clas- sification and robustness testing for large language models for code,

Y . Liu, A. Foundjem, F. Khomh, and H. Li, “Adversarial attack clas- sification and robustness testing for large language models for code,” Empirical Software Engineering, vol. 30, no. 5, p. 154, 2025

2025
[5]

arXiv preprint arXiv:2212.10264 , year=

S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V . Kumar, S. Tan, B. Ray, P. Bhatiaet al., “Recode: Robustness evaluation of code generation models,”arXiv preprint arXiv:2212.10264, 2022

work page arXiv 2022
[6]

How important are good method names in neural code generation? a model robustness perspective,

G. Yang, Y . Zhou, W. Yang, T. Yue, X. Chen, and T. Chen, “How important are good method names in neural code generation? a model robustness perspective,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 3, pp. 1–35, 2024

2024
[7]

On the robustness of code generation techniques: An empirical study on github copilot,

A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2149–2160

2023
[8]

Adversarial robustness of deep code comment generation,

Y . Zhou, X. Zhang, J. Shen, T. Han, T. Chen, and H. Gall, “Adversarial robustness of deep code comment generation,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–30, 2022

2022
[9]

Robustness evaluation of code generation systems via concretizing instructions,

M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Robustness evaluation of code generation systems via concretizing instructions,”Information and Software Technology, vol. 179, p. 107645, 2025. 24

2025
[10]

Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,

L. Zhong and Z. Wang, “Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 19, 2024, pp. 21 841–21 849

2024
[11]

Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 528–50 652, 2024

2024
[12]

Large language model-based agents for software engineering: A sur- vey,

J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A sur- vey,”ACM Transactions on Software Engineering and Methodology, 2024

2024
[13]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[14]

Large lan- guage models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

2022
[15]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Chain-of-verification reduces hallucination in large language models,

S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578

2024
[17]

Codecot: Tackling code syntax errors in cot reasoning for code generation,

D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,”arXiv e-prints, pp. arXiv– 2308, 2023

2023
[18]

Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?

Z. Zhou, R. Tao, J. Zhu, Y . Luo, Z. Wang, and B. Han, “Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?”Advances in Neural Information Processing Systems, vol. 37, pp. 123 846–123 910, 2024

2024
[19]

Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,

Z. Yin, Q. Sun, Q. Guo, Z. Zeng, X. Li, J. Dai, Q. Cheng, X.-J. Huang, and X. Qiu, “Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2401–2416

2024
[20]

Large language models for software engi- neering: A systematic literature review,

X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024

2024
[21]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

An empirical study on the code refactoring capability of large language models,

J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024

work page arXiv 2024
[23]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Measuring Coding Challenge Competence With APPS

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Songet al., “Measuring coding challenge competence with apps,”arXiv preprint arXiv:2105.09938, 2021

work page internal anchor Pith review arXiv 2021
[25]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

2019
[26]

Structured chain-of-thought prompting for code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025

2025
[27]

Towards bet- ter chain-of-thought prompting strategies: A survey,

Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen, “Towards bet- ter chain-of-thought prompting strategies: A survey,”arXiv preprint arXiv:2310.04959, 2023

work page arXiv 2023
[28]

Ryan Liu, Jiayi Geng, Addison J

Z. Li, Y . Chang, and Y . Wu, “Think-bench: Evaluating thinking effi- ciency and chain-of-thought quality of large reasoning models,”arXiv preprint arXiv:2505.22113, 2025

work page arXiv 2025
[29]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023

work page internal anchor Pith review arXiv 2023
[30]

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,

O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, and A. Majumdar, “A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,”ACM Computing Surveys, 2025

2025
[31]

24 Published in Transactions on Machine Learning Research (04/2026) A Appendix Contents A.1 Reward Design and PPO Stabilization Sensitivity

Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty- guided chain-of-thought for code generation with llms,”arXiv preprint arXiv:2503.15341, 2025

work page arXiv 2025
[32]

Beyond accuracy: Behavioral testing of nlp models with checklist,

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4902–4912

2020
[33]

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” in Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, 2023, pp. 57–68

2023
[34]

Codebert: A pre-trained model for programming and natural languages,

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jianget al., “Codebert: A pre-trained model for programming and natural languages,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 1536–1547

2020
[35]

GraphCodeBERT: Pre-training Code Representations with Data Flow

D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fuet al., “Graphcodebert: Pre-training code repre- sentations with data flow,”arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review arXiv 2009
[36]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,”arXiv preprint arXiv:2203.13474, 2022

work page internal anchor Pith review arXiv 2022
[37]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Star: Bootstrapping reasoning with reasoning,

E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,”Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022

2022
[39]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023

2023
[40]

and Liang, J

C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, L. Fei-Fei, F. Xia, and B. Ichter, “Chain of code: Reason- ing with a language model-augmented code emulator,”arXiv preprint arXiv:2312.04474, 2023

work page arXiv 2023
[41]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review arXiv 2022
[42]

Codeattack: Code-based adversarial attacks for pre-trained programming language models,

A. Jha and C. K. Reddy, “Codeattack: Code-based adversarial attacks for pre-trained programming language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 14 892–14 900

2023
[43]

Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation

J. Dai, J. Lu, Y . Feng, G. Zeng, R. Ruan, M. Cheng, D. Huang, H. Tan, and Z. Guo, “Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation,”arXiv preprint arXiv:2405.11430, 2024

work page arXiv 2024
[44]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paulet al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review arXiv 2024
[45]

Instruction tuning for secure code generation

J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,”arXiv preprint arXiv:2402.09497, 2024

work page arXiv 2024
[46]

The meaning and use of the area under a receiver operating characteristic (roc) curve

J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982

1982
[47]

The relationship between precision-recall and roc curves,

J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 233–240

2006
[48]

The proof and measurement of association between two things

C. Spearman, “The proof and measurement of association between two things.” 1961

1961
[49]

Multivariate extensions of spearman’s rho and related statistics,

F. Schmid and R. Schmidt, “Multivariate extensions of spearman’s rho and related statistics,”Statistics & probability letters, vol. 77, no. 4, pp. 407–416, 2007

2007
[50]

Multivariate Spearman’sρfor aggregating ranks using copulas,

J. Bed ˝o and C. S. Ong, “Multivariate Spearman’sρfor aggregating ranks using copulas,”Journal of Machine Learning Research, vol. 17, no. 201, pp. 1–30, 2016

2016
[51]

Large language models can be easily distracted by irrelevant context,

F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227

2023