An Empirical Study of Security Calibration in Large Language Models for Code

Joanna C. S. Santos; Md. Nafiu Rahman; Mohammed Latif Siddiq

arxiv: 2606.31159 · v1 · pith:JA4B4E3Gnew · submitted 2026-06-30 · 💻 cs.SE · cs.CR· cs.LG

An Empirical Study of Security Calibration in Large Language Models for Code

Mohammed Latif Siddiq , Md. Nafiu Rahman , Joanna C. S. Santos This is my paper

Pith reviewed 2026-07-01 05:01 UTC · model grok-4.3

classification 💻 cs.SE cs.CRcs.LG

keywords security calibrationLLM code generationoverconfidencefunctional correctnessvulnerability remediationmodel confidencesoftware security

0 comments

The pith

LLMs for code are overconfident, with calibration stronger for security outcomes than for functional correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures calibration in three LLMs by checking whether their stated confidence matches the actual security and correctness of the code they generate. It runs the models on self-contained security tasks and on multi-language repository contexts at different temperatures. Results show overconfidence is common, and that the models' confidence tracks security vulnerabilities more accurately than it tracks whether the code will execute correctly. The authors also test automated repair guided by calibration scores and several mitigation approaches, finding only modest gains and new problems in realistic settings.

Core claim

Overconfidence is prevalent across the evaluated LLMs. Functional calibration is consistently worse than security calibration, suggesting that models estimate security outcomes more reliably than functional correctness, potentially because functional correctness depends on complex execution behavior. Architectural gating improves calibration on controlled benchmarks but calibration deteriorates in realistic repository-level settings, increasing the risk of high-confidence vulnerable outputs.

What carries the argument

Security calibration versus functional calibration, quantified as the match between model-reported confidence and ground-truth outcomes on the two benchmark suites.

If this is right

Calibration-guided repair produces only limited vulnerability fixes and frequently adds functional regressions.
Architectural gating reduces false trust on controlled tasks but raises the rate of high-confidence vulnerable code in repository-level contexts.
Models appear to estimate security outcomes more reliably than they estimate whether generated code will run correctly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers using LLM confidence scores to decide whether to accept generated code may be accepting more functional risk than security risk.
Calibration differences could be used to decide when to apply extra static analysis or testing before deployment.
Training objectives that directly target functional execution traces might narrow the observed gap between the two calibration types.

Load-bearing premise

The two chosen benchmark suites give representative measures of both security vulnerabilities and functional correctness that extend to other code-generation tasks.

What would settle it

A follow-up evaluation on a different collection of security and functional tasks in which functional calibration matches or exceeds security calibration would undermine the central pattern reported.

Figures

Figures reproduced from arXiv: 2606.31159 by Joanna C. S. Santos, Md. Nafiu Rahman, Mohammed Latif Siddiq.

**Figure 1.** Figure 1: Overview of our Study Methodology CWE categories and enables precise, automated evaluation of both functional correctness and security. 2) Models & Parameters: Each of the 100 prompts from the SALLM benchmark is provided as input to three models that represent the current landscape of frontier and open-weight architectures [41] GPT-4o-mini [42], an optimized OpenAI model designed for low-latency reasoning … view at source ↗

read the original abstract

Large Language Models (LLMs) are rapidly transforming software development, yet their use in security-critical contexts raises a key question: do models know when their generated code is insecure? This property, known as calibration, measures whether a model's confidence aligns with the true correctness of its outputs. We present the first large-scale empirical study of security calibration in LLM-generated code. We evaluate GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across multiple temperature settings on two complementary benchmarks: self-contained security tasks and multi-language repository-level contexts. Our results suggest that overconfidence is prevalent across the evaluated LLMs. Functional calibration is consistently worse than security calibration, suggesting that models estimate security outcomes more reliably than functional correctness, potentially because functional correctness depends on complex execution behavior. We also examine whether calibration-guided automated repair can help remediate vulnerabilities in LLM-generated code, finding only limited improvements while frequently introducing functional regressions. Moreover, we study different mitigation strategies for reducing False Trust, where models assign high confidence to vulnerable code. The results show that although architectural gating improves calibration on controlled benchmarks, calibration deteriorates in realistic repository-level settings, increasing the risk of high-confidence vulnerable outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that code LLMs are overconfident overall but calibrate security better than functional correctness, with gating and repair showing limited real-world gains.

read the letter

The main point is that this is the first large empirical check on whether code-generating LLMs know when their outputs are insecure. They test GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next at varying temperatures on both simple security tasks and multi-language repo contexts. The results indicate overconfidence is common, security calibration holds up better than functional calibration, repair guided by confidence gives only small fixes while often breaking other behavior, and gating helps on controlled benchmarks but worsens in realistic repo settings.

What the work does well is run the same models across two benchmark styles and multiple temperatures, then directly compare security versus functional calibration and test two mitigation approaches. The observation that functional correctness is harder to calibrate because it depends on execution behavior is a reasonable hypothesis and matches the data pattern they report.

The load-bearing assumption is that the benchmarks produce accurate labels for both vulnerabilities and functional correctness. Self-contained tasks are straightforward to judge, but repository-level cases across languages raise the risk of noisy or incomplete ground truth, which could inflate or create the reported calibration gap. The paper also only covers three models, so the prevalence claim is narrow.

This is relevant for anyone building or deploying LLM code tools in security-sensitive settings. The empirical focus on calibration metrics and the practical tests of repair and gating make it worth a referee's time even if the labeling details need more scrutiny in revision.

Referee Report

2 major / 2 minor

Summary. The paper presents the first large-scale empirical study of security calibration in LLM-generated code. It evaluates GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across temperature settings on two benchmarks (self-contained security tasks and multi-language repository-level contexts), reporting prevalent overconfidence, consistently better security calibration than functional calibration, limited gains from calibration-guided repair (with frequent functional regressions), and deterioration of mitigation strategies like architectural gating when moving from controlled to realistic repository settings.

Significance. If the results hold, the work provides actionable evidence that LLMs tend to be overconfident about insecure code and that security self-assessment is more reliable than functional correctness assessment. This has direct implications for safe deployment of code-generating LLMs in security-critical contexts and motivates further research on calibration-aware generation and repair techniques.

major comments (2)

[Evaluation Setup / Benchmarks] The central claims (prevalent overconfidence; security calibration reliably superior to functional) rest on the accuracy of ground-truth labels for vulnerability presence and functional correctness in the two benchmarks. The abstract and evaluation description provide no details on sample sizes, labeling procedures (automated vs. manual), inter-rater reliability, or error rates; without these, the reported calibration gap could be an artifact of noisy or non-representative labels rather than a model property.
[Repair Experiments] The claim that calibration-guided automated repair yields "only limited improvements" while "frequently introducing functional regressions" requires quantitative support on the magnitude of regressions and the baseline repair success rate. The abstract does not report effect sizes, statistical significance, or controls for temperature and model, making it impossible to judge whether the limited benefit is robust.

minor comments (2)

[Abstract] The abstract states results "suggest" overconfidence and a calibration gap; the manuscript should clarify whether these are statistically tested differences or descriptive observations.
[Models] Model names (Qwen3-Coder-Next, Gemini-2.0-Flash) should be accompanied by exact version identifiers and access dates to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's significance. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation Setup / Benchmarks] The central claims (prevalent overconfidence; security calibration reliably superior to functional) rest on the accuracy of ground-truth labels for vulnerability presence and functional correctness in the two benchmarks. The abstract and evaluation description provide no details on sample sizes, labeling procedures (automated vs. manual), inter-rater reliability, or error rates; without these, the reported calibration gap could be an artifact of noisy or non-representative labels rather than a model property.

Authors: We agree that the abstract and high-level evaluation description omit explicit details on labeling methodology, which is a valid concern for assessing label quality. The full manuscript describes the two benchmarks and their sources but does not dedicate sufficient space to sample sizes, automated vs. manual procedures, inter-rater metrics, or measured error rates. In the revised version we will add a dedicated subsection in the evaluation setup that reports these elements (including exact sample counts per benchmark, the static analysis tools employed, the size and protocol of any manual verification, and any available reliability statistics). This addition will directly address the possibility of label noise affecting the observed calibration differences. revision: yes
Referee: [Repair Experiments] The claim that calibration-guided automated repair yields "only limited improvements" while "frequently introducing functional regressions" requires quantitative support on the magnitude of regressions and the baseline repair success rate. The abstract does not report effect sizes, statistical significance, or controls for temperature and model, making it impossible to judge whether the limited benefit is robust.

Authors: We accept that the abstract summarizes the repair outcomes at a high level without the requested quantitative details. The full results section already breaks down repair success and regression rates by model and temperature, but does not include effect sizes or formal significance tests. In the revision we will augment the repair experiment subsection with effect-size calculations, statistical significance results, and explicit confirmation that all comparisons control for temperature and model. These additions will provide the quantitative support needed to evaluate the robustness of the "limited improvements" and "functional regressions" findings. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivations or self-referential reductions

full rationale

The paper is a large-scale empirical evaluation of calibration in three LLMs across two benchmarks, reporting observed overconfidence rates, security-vs-functional differences, and mitigation outcomes. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented as deriving new results; all claims rest on direct experimental measurements. The central findings (prevalent overconfidence; security calibration better than functional) are statistical summaries of the collected data rather than reductions to prior self-citations or input definitions. Benchmark labeling assumptions are external validity concerns, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen benchmarks validly capture security and functional properties; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The selected benchmarks accurately reflect real-world security vulnerabilities and functional correctness.
The study interprets calibration differences between self-contained tasks and repository-level contexts as meaningful only if the benchmarks are representative.

pith-pipeline@v0.9.1-grok · 5755 in / 1108 out tokens · 35500 ms · 2026-07-01T05:01:43.214142+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 19 canonical work pages · 7 internal anchors

[1]

Security in the age of ai teammates: An empirical study of agentic pull requests on github,

M. L. Siddiq, X. Zhao, V . C. Lopes, B. Casey, and J. C. S. Santos, “Security in the age of ai teammates: An empirical study of agentic pull requests on github,” 2026, under-review in Information and Software Technology

2026
[2]

An empirical study of code smells in transformer-based code generation techniques,

M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. S. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in2022 IEEE 22nd International Working Con- ference on Source Code Analysis and Manipulation (SCAM), 2022, pp. 71–82

2022
[3]

Asleep at the keyboard? assessing the security of github copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” vol. 68, no. 2, Jan. 2025, p. 96–105. [Online]. Available: https://doi.org/10.1145/3610721

work page doi:10.1145/3610721 2025
[4]

Do users write more insecure code with ai assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23, 2023, p. 2785–2799. [Online]. Available: https://doi.org/10.1145/3576915.3623157

work page doi:10.1145/3576915.3623157 2023
[5]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2205–2222

2023
[6]

Sallm: Security assessment of generated code,

M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “Sallm: Security assessment of generated code,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops, ser. ASEW ’24, 2024, p. 54–65. [Online]. Available: https://doi.org/10.1145/3691621.3694934

work page doi:10.1145/3691621.3694934 2024
[7]

Truthfulqa: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 3214–3252

2022
[8]

2025.LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion

C. Spiess, D. Gros, K. S. Pai, M. Pradel, M. R. I. Rabin, A. Alipour, S. Jha, P. Devanbu, and T. Ahmed, “Calibration and correctness of language models for code,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25, 2025, p. 540–552. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2025
[9]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

SWE-bench: Can language models resolve real- world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real- world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

2024
[12]

A.s.e: A repository-level benchmark for evaluating security in ai-generated code,

K. Lian, B. Wang, L. Zhang, L. Chen, J. Wang, Z. Zhao, Y . Yang, M. Lin, H. Duan, H. Zhaoet al., “A.s.e: A repository-level benchmark for evaluating security in ai-generated code,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18106

work page arXiv 2025
[13]

LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” inIEEE Symposium on Security and Privacy (SP), 2024, arXiv:2312.12575

work page arXiv 2024
[14]

On the definition of appropriate trust and the tools that come with it,

H. Löfström, “On the definition of appropriate trust and the tools that come with it,” in2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). IEEE, 2023, pp. 1555– 1562

2023
[15]

Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,

M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapilet al., “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” Tech. Rep., 2024

2024
[16]

SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,

X. Li, J. Ding, C. Peng, B. Zhao, X. Gao, H. Gao, and X. Gu, “SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,”arXiv preprint arXiv:2506.05692, 2025

work page arXiv 2025
[17]

Obtaining well calibrated probabilities using bayesian binning,

M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” inProceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1, 2015

2015
[18]

Verification of forecasts expressed in terms of probability,

W. B. Glennet al., “Verification of forecasts expressed in terms of probability,”Monthly weather review, vol. 78, no. 1, pp. 1–3, 1950

1950
[19]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

doi: 10.1126/science.abq1158

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code gene...

work page doi:10.1126/science.abq1158 2022
[21]

How Secure is Code Generated by ChatGPT?

R. Khoury, A. R. Avila, J. Brunelle, B. M. Coutureet al., “How secure is code generated by chatgpt?”arXiv preprint arXiv:2304.09655, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,

M. L. Siddiq and J. C. S. Santos, “Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, ser. MSR4P&S 2022, 2022, p. 29–33. [Online]. Available: https://doi.org/10.1145/3549035.3561184

work page doi:10.1145/3549035.3561184 2022
[23]

CWE Top 25 Most Dangerous Software Weaknesses,

MITRE Corporation, “CWE Top 25 Most Dangerous Software Weaknesses,” 2023. [Online]. Available: https://cwe.mitre.org/top25/

2023
[24]

Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,

J. C. Santos, A. Peruma, M. Mirakhorli, M. Galstery, J. V . Vidal, and A. Sejfia, “Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,” in2017 IEEE International Conference on Software Architecture (ICSA). IEEE, 2017, pp. 69–78

2017
[25]

Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,

H. Hajipour, K. Hassler, T. Holz, L. Schönherr, and M. Fritz, “Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,” inSecond IEEE Conference on Secure and Trustworthy Machine Learning, 2024

2024
[26]

Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,

M. L. Siddiq, J. Zhang, L. Roney, and J. C. Santos, “Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, 2024, pp. 52–56

2024
[27]

How can we know when language models know? on the calibration of language models for question answering,

Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for question answering,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.57/

2021
[28]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017
[29]

Mea- suring calibration in deep learning,

J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Mea- suring calibration in deep learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, June 2019

2019
[30]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,

J. Plattet al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Tech. Rep. 3, 1999

1999
[31]

Predicting good probabilities with supervised learning,

A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” inProceedings of the 22nd international conference on Machine learning, 2005, pp. 625–632

2005
[32]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Teaching models to express their uncertainty in words,

S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id= 8s8K2UZGTZ

2022
[35]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. Manning, “Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Dec....

2023
[36]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” 2022

2022
[37]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22, 2022

2022
[38]

Calibration of pre-trained transformers,

S. Desai and G. Durrett, “Calibration of pre-trained transformers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds., Nov. 2020, pp. 295–302. [Online]. Available: https://aclanthology.org/2020.emnlp-main.21/

2020
[39]

Toward trustworthy neural program synthesis,

D. Key, W.-D. Li, and K. Ellis, “Toward trustworthy neural program synthesis,” 2023. [Online]. Available: https://arxiv.org/abs/2210.00848

work page arXiv 2023
[40]

On calibration of pre-trained code models,

Z. Zhou, C. Sha, and X. Peng, “On calibration of pre-trained code models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639126

work page doi:10.1145/3597503.3639126 2024
[41]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,”arXiv preprint arXiv:2305.01210, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

GPT-4o-mini Technical Overview,

OpenAI, “GPT-4o-mini Technical Overview,” https://openai.com, 2024, accessed February 2026

2024
[43]

Gemini Technical Report,

Google DeepMind, “Gemini Technical Report,” https://ai.google.dev, 2024, gemini-2.0-Flash API, accessed February 2026

2024
[44]

Qwen3-coder-next technical report,

Qwen Team, “Qwen3-coder-next technical report,” Tech. Rep., accessed: 2026-02-03. [Online]. Available: https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf

2026
[45]

Breaking the silence: the threats of using llms in software engineering,

J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in software engineering,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE-NIER’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 102–106. [Online]. Available: htt...

work page doi:10.1145/3639476.3639764 2024
[46]

Classifier calibration with roc- regularized isotonic regression,

E. Berta, F. Bach, and M. Jordan, “Classifier calibration with roc- regularized isotonic regression,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 1972–1980

2024

[1] [1]

Security in the age of ai teammates: An empirical study of agentic pull requests on github,

M. L. Siddiq, X. Zhao, V . C. Lopes, B. Casey, and J. C. S. Santos, “Security in the age of ai teammates: An empirical study of agentic pull requests on github,” 2026, under-review in Information and Software Technology

2026

[2] [2]

An empirical study of code smells in transformer-based code generation techniques,

M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. S. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in2022 IEEE 22nd International Working Con- ference on Source Code Analysis and Manipulation (SCAM), 2022, pp. 71–82

2022

[3] [3]

Asleep at the keyboard? assessing the security of github copilot’s code contributions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” vol. 68, no. 2, Jan. 2025, p. 96–105. [Online]. Available: https://doi.org/10.1145/3610721

work page doi:10.1145/3610721 2025

[4] [4]

Do users write more insecure code with ai assistants?

N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23, 2023, p. 2785–2799. [Online]. Available: https://doi.org/10.1145/3576915.3623157

work page doi:10.1145/3576915.3623157 2023

[5] [5]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2205–2222

2023

[6] [6]

Sallm: Security assessment of generated code,

M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “Sallm: Security assessment of generated code,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops, ser. ASEW ’24, 2024, p. 54–65. [Online]. Available: https://doi.org/10.1145/3691621.3694934

work page doi:10.1145/3691621.3694934 2024

[7] [7]

Truthfulqa: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 3214–3252

2022

[8] [8]

2025.LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion

C. Spiess, D. Gros, K. S. Pai, M. Pradel, M. R. I. Rabin, A. Alipour, S. Jha, P. Devanbu, and T. Ahmed, “Calibration and correctness of language models for code,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25, 2025, p. 540–552. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2025

[9] [9]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

SWE-bench: Can language models resolve real- world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real- world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

2024

[12] [12]

A.s.e: A repository-level benchmark for evaluating security in ai-generated code,

K. Lian, B. Wang, L. Zhang, L. Chen, J. Wang, Z. Zhao, Y . Yang, M. Lin, H. Duan, H. Zhaoet al., “A.s.e: A repository-level benchmark for evaluating security in ai-generated code,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18106

work page arXiv 2025

[13] [13]

LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” inIEEE Symposium on Security and Privacy (SP), 2024, arXiv:2312.12575

work page arXiv 2024

[14] [14]

On the definition of appropriate trust and the tools that come with it,

H. Löfström, “On the definition of appropriate trust and the tools that come with it,” in2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). IEEE, 2023, pp. 1555– 1562

2023

[15] [15]

Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,

M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapilet al., “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” Tech. Rep., 2024

2024

[16] [16]

SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,

X. Li, J. Ding, C. Peng, B. Zhao, X. Gao, H. Gao, and X. Gu, “SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,”arXiv preprint arXiv:2506.05692, 2025

work page arXiv 2025

[17] [17]

Obtaining well calibrated probabilities using bayesian binning,

M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” inProceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1, 2015

2015

[18] [18]

Verification of forecasts expressed in terms of probability,

W. B. Glennet al., “Verification of forecasts expressed in terms of probability,”Monthly weather review, vol. 78, no. 1, pp. 1–3, 1950

1950

[19] [19]

GPT-4 Technical Report

OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

doi: 10.1126/science.abq1158

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code gene...

work page doi:10.1126/science.abq1158 2022

[21] [21]

How Secure is Code Generated by ChatGPT?

R. Khoury, A. R. Avila, J. Brunelle, B. M. Coutureet al., “How secure is code generated by chatgpt?”arXiv preprint arXiv:2304.09655, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,

M. L. Siddiq and J. C. S. Santos, “Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, ser. MSR4P&S 2022, 2022, p. 29–33. [Online]. Available: https://doi.org/10.1145/3549035.3561184

work page doi:10.1145/3549035.3561184 2022

[23] [23]

CWE Top 25 Most Dangerous Software Weaknesses,

MITRE Corporation, “CWE Top 25 Most Dangerous Software Weaknesses,” 2023. [Online]. Available: https://cwe.mitre.org/top25/

2023

[24] [24]

Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,

J. C. Santos, A. Peruma, M. Mirakhorli, M. Galstery, J. V . Vidal, and A. Sejfia, “Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,” in2017 IEEE International Conference on Software Architecture (ICSA). IEEE, 2017, pp. 69–78

2017

[25] [25]

Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,

H. Hajipour, K. Hassler, T. Holz, L. Schönherr, and M. Fritz, “Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,” inSecond IEEE Conference on Secure and Trustworthy Machine Learning, 2024

2024

[26] [26]

Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,

M. L. Siddiq, J. Zhang, L. Roney, and J. C. Santos, “Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, 2024, pp. 52–56

2024

[27] [27]

How can we know when language models know? on the calibration of language models for question answering,

Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for question answering,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.57/

2021

[28] [28]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

2017

[29] [29]

Mea- suring calibration in deep learning,

J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Mea- suring calibration in deep learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, June 2019

2019

[30] [30]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,

J. Plattet al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Tech. Rep. 3, 1999

1999

[31] [31]

Predicting good probabilities with supervised learning,

A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” inProceedings of the 22nd international conference on Machine learning, 2005, pp. 625–632

2005

[32] [32]

Language Models (Mostly) Know What They Know

S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Teaching models to express their uncertainty in words,

S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id= 8s8K2UZGTZ

2022

[35] [35]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,

K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. Manning, “Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Dec....

2023

[36] [36]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” 2022

2022

[37] [37]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22, 2022

2022

[38] [38]

Calibration of pre-trained transformers,

S. Desai and G. Durrett, “Calibration of pre-trained transformers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds., Nov. 2020, pp. 295–302. [Online]. Available: https://aclanthology.org/2020.emnlp-main.21/

2020

[39] [39]

Toward trustworthy neural program synthesis,

D. Key, W.-D. Li, and K. Ellis, “Toward trustworthy neural program synthesis,” 2023. [Online]. Available: https://arxiv.org/abs/2210.00848

work page arXiv 2023

[40] [40]

On calibration of pre-trained code models,

Z. Zhou, C. Sha, and X. Peng, “On calibration of pre-trained code models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639126

work page doi:10.1145/3597503.3639126 2024

[41] [41]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,”arXiv preprint arXiv:2305.01210, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

GPT-4o-mini Technical Overview,

OpenAI, “GPT-4o-mini Technical Overview,” https://openai.com, 2024, accessed February 2026

2024

[43] [43]

Gemini Technical Report,

Google DeepMind, “Gemini Technical Report,” https://ai.google.dev, 2024, gemini-2.0-Flash API, accessed February 2026

2024

[44] [44]

Qwen3-coder-next technical report,

Qwen Team, “Qwen3-coder-next technical report,” Tech. Rep., accessed: 2026-02-03. [Online]. Available: https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf

2026

[45] [45]

Breaking the silence: the threats of using llms in software engineering,

J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in software engineering,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE-NIER’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 102–106. [Online]. Available: htt...

work page doi:10.1145/3639476.3639764 2024

[46] [46]

Classifier calibration with roc- regularized isotonic regression,

E. Berta, F. Bach, and M. Jordan, “Classifier calibration with roc- regularized isotonic regression,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 1972–1980

2024