Recognition: unknown
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3
The pith
CoT for code generation stays robust only when perturbations spare three key structural commitment points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoT prompting does not provide uniform gains in performance or robustness for code generation tasks. Instead, its effectiveness depends on whether input perturbations destabilize structurally sensitive commitment points in the trajectory from reasoning to code output. The study identifies three anchors at which this occurs and shows that their disruption produces characteristic deformations in the generation trace, which can be detected through token-level uncertainty measurements.
What carries the argument
Three structural anchors—reasoning-code transition, symbolic commitment, and algorithmic articulation—that mark commitment points in the reasoning trajectory where perturbations can trigger specific deformations and failures.
If this is right
- CoT benefits are not uniform but contingent on model family, task structure, and prompt explicitness.
- CoT and No-CoT prompting exhibit distinct robustness profiles, with different perturbation families triggering different failure modes.
- Three recurrent trajectory deformations—lengthening, branching, and simplification—systematically emerge when perturbations interact with the structural anchors.
- Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors.
Where Pith is reading between the lines
- Generation systems could monitor uncertainty around these anchors in real time to decide whether to use CoT or fall back to direct prompting.
- The framework of structural anchors and trajectory deformations could be tested on non-code reasoning tasks such as mathematical proofs or logical deduction.
- Prompt design might deliberately reinforce the three anchors to reduce sensitivity to realistic input noise.
- New robustness benchmarks could be built by systematically targeting these commitment points rather than applying random changes.
Load-bearing premise
The three structural anchors capture the primary points of fragility and the chosen character, word, and sentence perturbations adequately represent realistic input variations in code generation tasks.
What would settle it
Observing a new set of models or benchmarks where perturbing the identified anchors does not increase failure rates or where early uncertainty spikes fail to precede trajectory deformations would disprove the central claim.
Figures
read the original abstract
Chain-of-Thought (CoT) prompting is widely used to elicit explicit reasoning from large language models for code (LLM4Code). However, its impact on robustness and the stability of reasoning trajectories under realistic input perturbations remains poorly understood. Prior work has largely evaluated CoT through final correctness, leaving a critical gap in understanding how CoT reshapes internal uncertainty dynamics and why it sometimes harms rather than helps code generation. We suggest that CoT is not uniformly beneficial; instead, its robustness depends on whether perturbations destabilize structurally sensitive commitment points along the reasoning-to-code trajectory. We conduct a controlled, large-scale empirical study of CoT across six models and two code benchmarks (MHPP and BigCodeBench), subjecting task docstrings to systematic character-, word-, and sentence-level perturbations. We instrument full generation traces with token-level uncertainty and define three novel structural anchors: reasoning-code transition, symbolic commitment, and algorithmic articulation. Findings: (1) CoT does not yield uniform performance or robustness gains: its benefits are contingent on model family, task structure, and prompt explicitness. (2) CoT and No-CoT exhibit distinct robustness profiles, with different perturbation families triggering different failure modes. (3) We identify three recurrent trajectory deformations--Lengthening, Branching, and Simplification--that systematically emerge when perturbations interact with structural anchors and explain failure patterns. (4) Early-stage uncertainty serves as a reliable diagnostic signal for localizing where trajectory instability begins around sensitive anchors. These results provide a unified explanation for CoT's mixed performance and suggest design principles for building more robust reasoning-based code generators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a large-scale empirical study of Chain-of-Thought (CoT) prompting for code generation across six LLMs and two benchmarks (MHPP, BigCodeBench). It subjects docstrings to character-, word-, and sentence-level perturbations, instruments token-level uncertainty in full traces, and defines three structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation). The central claim is that CoT robustness is not uniform but depends on whether perturbations destabilize these anchors, producing recurrent trajectory deformations (Lengthening, Branching, Simplification) that explain mixed performance; early uncertainty is proposed as a diagnostic for instability.
Significance. If the empirical links hold, the work supplies a mechanistic account of why CoT sometimes degrades rather than improves code generation, shifting evaluation from final accuracy to internal trajectory stability. The controlled perturbation design and uncertainty instrumentation are strengths that enable concrete, falsifiable observations on fragility points and could inform more robust prompting strategies in LLM4Code.
major comments (3)
- [§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.
- [§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.
- [§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.
minor comments (3)
- [Abstract] The abstract and §2 use the term 'structural anchors' without an early, compact definition or diagram; readers must reach §3.2 to understand the three specific points.
- [§4] Table or figure captions for the deformation examples (Lengthening, Branching, Simplification) should explicitly state the perturbation type and model that produced each illustrated trace.
- [§2] The related-work section should cite recent studies on uncertainty estimation in LLM reasoning (e.g., token-level entropy methods) to situate the instrumentation approach.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the empirical rigor and generalizability of our claims. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (definition of structural anchors): the three anchors are identified from instrumented traces and then linked to observed deformations, but the manuscript provides no pre-specified, independent validation that these points are primary rather than post-hoc selections that co-occur with failures. Without an a priori test (e.g., predictive power on held-out data or comparison to other candidate points), the claim that robustness is governed by destabilization at exactly these anchors remains under-supported.
Authors: We agree that the structural anchors were derived from systematic observation of the instrumented traces rather than fully pre-specified in advance. In the revision, we will update §3.2 to ground the three anchors in a priori hypotheses drawn from prior literature on LLM reasoning steps and code generation trajectories. We will also add a validation analysis on held-out data demonstrating their predictive power for failure modes, including explicit comparisons against alternative candidate points (e.g., mid-reasoning transitions or output-only steps) to establish their relative primacy. revision: yes
-
Referee: [§4] §4 (experimental results on robustness profiles): the study is described as large-scale yet reports no sample sizes per condition, no statistical tests, confidence intervals, or controls for multiple comparisons when asserting that CoT and No-CoT exhibit distinct profiles and that specific perturbation families trigger different failure modes. This absence makes it impossible to determine whether reported differences exceed noise or confounding factors such as model scale.
Authors: We acknowledge this gap in statistical reporting. The revised §4 will explicitly report sample sizes per condition (tasks per model, benchmark, and perturbation type). We will incorporate appropriate statistical tests (paired t-tests and ANOVA for performance differences; chi-squared tests for failure mode distributions), 95% confidence intervals, and multiple-comparison corrections (Bonferroni). These additions will control for model scale and substantiate the claims of distinct robustness profiles. revision: yes
-
Referee: [§3.1] §3.1 (perturbation design) and §5 (explanation of failure patterns): character/word/sentence edits to docstrings are used to induce deformations, but no evidence is given that these artificial edits explain more variance than simpler baselines (e.g., prompt length or lexical diversity). The manuscript also lacks any comparison to naturalistic variations (altered constraints, missing edge cases) that occur in real code tasks, weakening the assertion that the chosen perturbations adequately represent realistic input fragility.
Authors: We will revise §3.1 to include regression-based comparisons showing the additional variance explained by our structured perturbations over simpler baselines such as prompt length and lexical diversity. For naturalistic variations, we will add a discussion in §5 relating our controlled findings to real-world prompt changes observed in code repositories and developer forums. A limited comparison using naturally varied prompts from BigCodeBench will be included where data permits; otherwise, we will explicitly frame the absence of a full naturalistic study as a limitation and future direction. revision: partial
Circularity Check
No circularity detected; empirical claims rest on independent observations
full rationale
The paper conducts a controlled empirical study by applying systematic perturbations to task docstrings, instrumenting token-level uncertainty in generation traces, defining three structural anchors from those traces, and reporting observed trajectory deformations (Lengthening, Branching, Simplification) and their correlations with failure modes. No equations, fitted parameters, or predictions appear that reduce by construction to the inputs; the anchors are introduced as novel definitions rather than derived quantities, and results are presented as contingent empirical findings across models and benchmarks rather than forced by self-definition or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Character-, word-, and sentence-level perturbations to docstrings represent realistic input variations in code generation tasks.
- domain assumption Token-level uncertainty measurements serve as a valid diagnostic for localizing instability around structural anchors.
invented entities (1)
-
Structural anchors (reasoning-code transition, symbolic commitment, algorithmic articulation)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapinet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review arXiv 2024
-
[3]
Qwen2. 5-coder technical report,
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Danget al., “Qwen2. 5-coder technical report,”CoRR, 2024
2024
-
[4]
Adversarial attack clas- sification and robustness testing for large language models for code,
Y . Liu, A. Foundjem, F. Khomh, and H. Li, “Adversarial attack clas- sification and robustness testing for large language models for code,” Empirical Software Engineering, vol. 30, no. 5, p. 154, 2025
2025
-
[5]
arXiv preprint arXiv:2212.10264 , year=
S. Wang, Z. Li, H. Qian, C. Yang, Z. Wang, M. Shang, V . Kumar, S. Tan, B. Ray, P. Bhatiaet al., “Recode: Robustness evaluation of code generation models,”arXiv preprint arXiv:2212.10264, 2022
-
[6]
How important are good method names in neural code generation? a model robustness perspective,
G. Yang, Y . Zhou, W. Yang, T. Yue, X. Chen, and T. Chen, “How important are good method names in neural code generation? a model robustness perspective,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 3, pp. 1–35, 2024
2024
-
[7]
On the robustness of code generation techniques: An empirical study on github copilot,
A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2149–2160
2023
-
[8]
Adversarial robustness of deep code comment generation,
Y . Zhou, X. Zhang, J. Shen, T. Han, T. Chen, and H. Gall, “Adversarial robustness of deep code comment generation,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–30, 2022
2022
-
[9]
Robustness evaluation of code generation systems via concretizing instructions,
M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Robustness evaluation of code generation systems via concretizing instructions,”Information and Software Technology, vol. 179, p. 107645, 2025. 24
2025
-
[10]
Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,
L. Zhong and Z. Wang, “Can llm replace stack overflow? a study on robustness and reliability of large language model code generation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 19, 2024, pp. 21 841–21 849
2024
-
[11]
Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,”Advances in Neural Information Processing Systems, vol. 37, pp. 50 528–50 652, 2024
2024
-
[12]
Large language model-based agents for software engineering: A sur- vey,
J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A sur- vey,”ACM Transactions on Software Engineering and Methodology, 2024
2024
-
[13]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[14]
Large lan- guage models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022
2022
-
[15]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Chain-of-verification reduces hallucination in large language models,
S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578
2024
-
[17]
Codecot: Tackling code syntax errors in cot reasoning for code generation,
D. Huang, Q. Bu, Y . Qing, and H. Cui, “Codecot: Tackling code syntax errors in cot reasoning for code generation,”arXiv e-prints, pp. arXiv– 2308, 2023
2023
-
[18]
Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?
Z. Zhou, R. Tao, J. Zhu, Y . Luo, Z. Wang, and B. Han, “Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?”Advances in Neural Information Processing Systems, vol. 37, pp. 123 846–123 910, 2024
2024
-
[19]
Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,
Z. Yin, Q. Sun, Q. Guo, Z. Zeng, X. Li, J. Dai, Q. Cheng, X.-J. Huang, and X. Qiu, “Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2401–2416
2024
-
[20]
Large language models for software engi- neering: A systematic literature review,
X. Hou, Y . Zhao, Y . Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for software engi- neering: A systematic literature review,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 8, pp. 1–79, 2024
2024
-
[21]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
An empirical study on the code refactoring capability of large language models,
J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024
-
[23]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Leet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Measuring Coding Challenge Competence With APPS
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Songet al., “Measuring coding challenge competence with apps,”arXiv preprint arXiv:2105.09938, 2021
work page internal anchor Pith review arXiv 2021
-
[25]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019
2019
-
[26]
Structured chain-of-thought prompting for code generation,
J. Li, G. Li, Y . Li, and Z. Jin, “Structured chain-of-thought prompting for code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–23, 2025
2025
-
[27]
Towards bet- ter chain-of-thought prompting strategies: A survey,
Z. Yu, L. He, Z. Wu, X. Dai, and J. Chen, “Towards bet- ter chain-of-thought prompting strategies: A survey,”arXiv preprint arXiv:2310.04959, 2023
-
[28]
Ryan Liu, Jiayi Geng, Addison J
Z. Li, Y . Chang, and Y . Wu, “Think-bench: Evaluating thinking effi- ciency and chain-of-thought quality of large reasoning models,”arXiv preprint arXiv:2505.22113, 2025
-
[29]
L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,
O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, and A. Majumdar, “A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions,”ACM Computing Surveys, 2025
2025
-
[31]
Y . Zhu, G. Li, X. Jiang, J. Li, H. Mei, Z. Jin, and Y . Dong, “Uncertainty- guided chain-of-thought for code generation with llms,”arXiv preprint arXiv:2503.15341, 2025
-
[32]
Beyond accuracy: Behavioral testing of nlp models with checklist,
M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4902–4912
2020
-
[33]
Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,
K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” in Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, 2023, pp. 57–68
2023
-
[34]
Codebert: A pre-trained model for programming and natural languages,
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jianget al., “Codebert: A pre-trained model for programming and natural languages,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 1536–1547
2020
-
[35]
GraphCodeBERT: Pre-training Code Representations with Data Flow
D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fuet al., “Graphcodebert: Pre-training code repre- sentations with data flow,”arXiv preprint arXiv:2009.08366, 2020
work page internal anchor Pith review arXiv 2009
-
[36]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,”arXiv preprint arXiv:2203.13474, 2022
work page internal anchor Pith review arXiv 2022
-
[37]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Star: Bootstrapping reasoning with reasoning,
E. Zelikman, Y . Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,”Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022
2022
-
[39]
Tree of thoughts: Deliberate problem solving with large language models,
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,”Advances in neural information processing systems, vol. 36, pp. 11 809–11 822, 2023
2023
-
[40]
C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, L. Fei-Fei, F. Xia, and B. Ichter, “Chain of code: Reason- ing with a language model-augmented code emulator,”arXiv preprint arXiv:2312.04474, 2023
-
[41]
Language Models (Mostly) Know What They Know
S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review arXiv 2022
-
[42]
Codeattack: Code-based adversarial attacks for pre-trained programming language models,
A. Jha and C. K. Reddy, “Codeattack: Code-based adversarial attacks for pre-trained programming language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 14 892–14 900
2023
-
[43]
Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation
J. Dai, J. Lu, Y . Feng, G. Zeng, R. Ruan, M. Cheng, D. Huang, H. Tan, and Z. Guo, “Mhpp: Exploring the capabilities and limitations of language models beyond basic code generation,”arXiv preprint arXiv:2405.11430, 2024
-
[44]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
T. Y . Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paulet al., “Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions,” arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review arXiv 2024
-
[45]
Instruction tuning for secure code generation
J. He, M. Vero, G. Krasnopolska, and M. Vechev, “Instruction tuning for secure code generation,”arXiv preprint arXiv:2402.09497, 2024
-
[46]
The meaning and use of the area under a receiver operating characteristic (roc) curve
J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982
1982
-
[47]
The relationship between precision-recall and roc curves,
J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 233–240
2006
-
[48]
The proof and measurement of association between two things
C. Spearman, “The proof and measurement of association between two things.” 1961
1961
-
[49]
Multivariate extensions of spearman’s rho and related statistics,
F. Schmid and R. Schmidt, “Multivariate extensions of spearman’s rho and related statistics,”Statistics & probability letters, vol. 77, no. 4, pp. 407–416, 2007
2007
-
[50]
Multivariate Spearman’sρfor aggregating ranks using copulas,
J. Bed ˝o and C. S. Ong, “Multivariate Spearman’sρfor aggregating ranks using copulas,”Journal of Machine Learning Research, vol. 17, no. 201, pp. 1–30, 2016
2016
-
[51]
Large language models can be easily distracted by irrelevant context,
F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.