pith. sign in

arxiv: 2606.31159 · v1 · pith:JA4B4E3Gnew · submitted 2026-06-30 · 💻 cs.SE · cs.CR· cs.LG

An Empirical Study of Security Calibration in Large Language Models for Code

Pith reviewed 2026-07-01 05:01 UTC · model grok-4.3

classification 💻 cs.SE cs.CRcs.LG
keywords security calibrationLLM code generationoverconfidencefunctional correctnessvulnerability remediationmodel confidencesoftware security
0
0 comments X

The pith

LLMs for code are overconfident, with calibration stronger for security outcomes than for functional correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures calibration in three LLMs by checking whether their stated confidence matches the actual security and correctness of the code they generate. It runs the models on self-contained security tasks and on multi-language repository contexts at different temperatures. Results show overconfidence is common, and that the models' confidence tracks security vulnerabilities more accurately than it tracks whether the code will execute correctly. The authors also test automated repair guided by calibration scores and several mitigation approaches, finding only modest gains and new problems in realistic settings.

Core claim

Overconfidence is prevalent across the evaluated LLMs. Functional calibration is consistently worse than security calibration, suggesting that models estimate security outcomes more reliably than functional correctness, potentially because functional correctness depends on complex execution behavior. Architectural gating improves calibration on controlled benchmarks but calibration deteriorates in realistic repository-level settings, increasing the risk of high-confidence vulnerable outputs.

What carries the argument

Security calibration versus functional calibration, quantified as the match between model-reported confidence and ground-truth outcomes on the two benchmark suites.

If this is right

  • Calibration-guided repair produces only limited vulnerability fixes and frequently adds functional regressions.
  • Architectural gating reduces false trust on controlled tasks but raises the rate of high-confidence vulnerable code in repository-level contexts.
  • Models appear to estimate security outcomes more reliably than they estimate whether generated code will run correctly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers using LLM confidence scores to decide whether to accept generated code may be accepting more functional risk than security risk.
  • Calibration differences could be used to decide when to apply extra static analysis or testing before deployment.
  • Training objectives that directly target functional execution traces might narrow the observed gap between the two calibration types.

Load-bearing premise

The two chosen benchmark suites give representative measures of both security vulnerabilities and functional correctness that extend to other code-generation tasks.

What would settle it

A follow-up evaluation on a different collection of security and functional tasks in which functional calibration matches or exceeds security calibration would undermine the central pattern reported.

Figures

Figures reproduced from arXiv: 2606.31159 by Joanna C. S. Santos, Md. Nafiu Rahman, Mohammed Latif Siddiq.

Figure 1
Figure 1. Figure 1: Overview of our Study Methodology CWE categories and enables precise, automated evaluation of both functional correctness and security. 2) Models & Parameters: Each of the 100 prompts from the SALLM benchmark is provided as input to three models that represent the current landscape of frontier and open-weight architectures [41] GPT-4o-mini [42], an optimized OpenAI model designed for low-latency reasoning … view at source ↗
read the original abstract

Large Language Models (LLMs) are rapidly transforming software development, yet their use in security-critical contexts raises a key question: do models know when their generated code is insecure? This property, known as calibration, measures whether a model's confidence aligns with the true correctness of its outputs. We present the first large-scale empirical study of security calibration in LLM-generated code. We evaluate GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across multiple temperature settings on two complementary benchmarks: self-contained security tasks and multi-language repository-level contexts. Our results suggest that overconfidence is prevalent across the evaluated LLMs. Functional calibration is consistently worse than security calibration, suggesting that models estimate security outcomes more reliably than functional correctness, potentially because functional correctness depends on complex execution behavior. We also examine whether calibration-guided automated repair can help remediate vulnerabilities in LLM-generated code, finding only limited improvements while frequently introducing functional regressions. Moreover, we study different mitigation strategies for reducing False Trust, where models assign high confidence to vulnerable code. The results show that although architectural gating improves calibration on controlled benchmarks, calibration deteriorates in realistic repository-level settings, increasing the risk of high-confidence vulnerable outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first large-scale empirical study of security calibration in LLM-generated code. It evaluates GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across temperature settings on two benchmarks (self-contained security tasks and multi-language repository-level contexts), reporting prevalent overconfidence, consistently better security calibration than functional calibration, limited gains from calibration-guided repair (with frequent functional regressions), and deterioration of mitigation strategies like architectural gating when moving from controlled to realistic repository settings.

Significance. If the results hold, the work provides actionable evidence that LLMs tend to be overconfident about insecure code and that security self-assessment is more reliable than functional correctness assessment. This has direct implications for safe deployment of code-generating LLMs in security-critical contexts and motivates further research on calibration-aware generation and repair techniques.

major comments (2)
  1. [Evaluation Setup / Benchmarks] The central claims (prevalent overconfidence; security calibration reliably superior to functional) rest on the accuracy of ground-truth labels for vulnerability presence and functional correctness in the two benchmarks. The abstract and evaluation description provide no details on sample sizes, labeling procedures (automated vs. manual), inter-rater reliability, or error rates; without these, the reported calibration gap could be an artifact of noisy or non-representative labels rather than a model property.
  2. [Repair Experiments] The claim that calibration-guided automated repair yields "only limited improvements" while "frequently introducing functional regressions" requires quantitative support on the magnitude of regressions and the baseline repair success rate. The abstract does not report effect sizes, statistical significance, or controls for temperature and model, making it impossible to judge whether the limited benefit is robust.
minor comments (2)
  1. [Abstract] The abstract states results "suggest" overconfidence and a calibration gap; the manuscript should clarify whether these are statistically tested differences or descriptive observations.
  2. [Models] Model names (Qwen3-Coder-Next, Gemini-2.0-Flash) should be accompanied by exact version identifiers and access dates to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's significance. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Setup / Benchmarks] The central claims (prevalent overconfidence; security calibration reliably superior to functional) rest on the accuracy of ground-truth labels for vulnerability presence and functional correctness in the two benchmarks. The abstract and evaluation description provide no details on sample sizes, labeling procedures (automated vs. manual), inter-rater reliability, or error rates; without these, the reported calibration gap could be an artifact of noisy or non-representative labels rather than a model property.

    Authors: We agree that the abstract and high-level evaluation description omit explicit details on labeling methodology, which is a valid concern for assessing label quality. The full manuscript describes the two benchmarks and their sources but does not dedicate sufficient space to sample sizes, automated vs. manual procedures, inter-rater metrics, or measured error rates. In the revised version we will add a dedicated subsection in the evaluation setup that reports these elements (including exact sample counts per benchmark, the static analysis tools employed, the size and protocol of any manual verification, and any available reliability statistics). This addition will directly address the possibility of label noise affecting the observed calibration differences. revision: yes

  2. Referee: [Repair Experiments] The claim that calibration-guided automated repair yields "only limited improvements" while "frequently introducing functional regressions" requires quantitative support on the magnitude of regressions and the baseline repair success rate. The abstract does not report effect sizes, statistical significance, or controls for temperature and model, making it impossible to judge whether the limited benefit is robust.

    Authors: We accept that the abstract summarizes the repair outcomes at a high level without the requested quantitative details. The full results section already breaks down repair success and regression rates by model and temperature, but does not include effect sizes or formal significance tests. In the revision we will augment the repair experiment subsection with effect-size calculations, statistical significance results, and explicit confirmation that all comparisons control for temperature and model. These additions will provide the quantitative support needed to evaluate the robustness of the "limited improvements" and "functional regressions" findings. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivations or self-referential reductions

full rationale

The paper is a large-scale empirical evaluation of calibration in three LLMs across two benchmarks, reporting observed overconfidence rates, security-vs-functional differences, and mitigation outcomes. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented as deriving new results; all claims rest on direct experimental measurements. The central findings (prevalent overconfidence; security calibration better than functional) are statistical summaries of the collected data rather than reductions to prior self-citations or input definitions. Benchmark labeling assumptions are external validity concerns, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen benchmarks validly capture security and functional properties; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The selected benchmarks accurately reflect real-world security vulnerabilities and functional correctness.
    The study interprets calibration differences between self-contained tasks and repository-level contexts as meaningful only if the benchmarks are representative.

pith-pipeline@v0.9.1-grok · 5755 in / 1108 out tokens · 35500 ms · 2026-07-01T05:01:43.214142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    Security in the age of ai teammates: An empirical study of agentic pull requests on github,

    M. L. Siddiq, X. Zhao, V . C. Lopes, B. Casey, and J. C. S. Santos, “Security in the age of ai teammates: An empirical study of agentic pull requests on github,” 2026, under-review in Information and Software Technology

  2. [2]

    An empirical study of code smells in transformer-based code generation techniques,

    M. L. Siddiq, S. H. Majumder, M. R. Mim, S. Jajodia, and J. C. S. Santos, “An empirical study of code smells in transformer-based code generation techniques,” in2022 IEEE 22nd International Working Con- ference on Source Code Analysis and Manipulation (SCAM), 2022, pp. 71–82

  3. [3]

    Asleep at the keyboard? assessing the security of github copilot’s code contributions,

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” vol. 68, no. 2, Jan. 2025, p. 96–105. [Online]. Available: https://doi.org/10.1145/3610721

  4. [4]

    Do users write more insecure code with ai assistants?

    N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” inProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23, 2023, p. 2785–2799. [Online]. Available: https://doi.org/10.1145/3576915.3623157

  5. [5]

    Lost at c: A user study on the security implications of large language model code assistants,

    G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2205–2222

  6. [6]

    Sallm: Security assessment of generated code,

    M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “Sallm: Security assessment of generated code,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops, ser. ASEW ’24, 2024, p. 54–65. [Online]. Available: https://doi.org/10.1145/3691621.3694934

  7. [7]

    Truthfulqa: Measuring how models mimic human falsehoods,

    S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 3214–3252

  8. [8]

    64 Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, and Mohammad Masudur Rahman

    C. Spiess, D. Gros, K. S. Pai, M. Pradel, M. R. I. Rabin, A. Alipour, S. Jha, P. Devanbu, and T. Ahmed, “Calibration and correctness of language models for code,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25, 2025, p. 540–552. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00040

  9. [9]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

  10. [10]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program synthesis with large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

  11. [11]

    SWE-bench: Can language models resolve real- world github issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real- world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

  12. [12]

    A.s.e: A repository-level benchmark for evaluating security in ai-generated code,

    K. Lian, B. Wang, L. Zhang, L. Chen, J. Wang, Z. Zhao, Y . Yang, M. Lin, H. Duan, H. Zhaoet al., “A.s.e: A repository-level benchmark for evaluating security in ai-generated code,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18106

  13. [13]

    LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

    S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “LLMs cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” inIEEE Symposium on Security and Privacy (SP), 2024, arXiv:2312.12575

  14. [14]

    On the definition of appropriate trust and the tools that come with it,

    H. Löfström, “On the definition of appropriate trust and the tools that come with it,” in2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). IEEE, 2023, pp. 1555– 1562

  15. [15]

    Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,

    M. Bhatt, S. Chennabasappa, Y . Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y . Chen, D. Kapilet al., “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” Tech. Rep., 2024

  16. [16]

    SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,

    X. Li, J. Ding, C. Peng, B. Zhao, X. Gao, H. Gao, and X. Gu, “SafeGenBench: A benchmark framework for security vulnerability detection in LLM-generated code,”arXiv preprint arXiv:2506.05692, 2025

  17. [17]

    Obtaining well calibrated probabilities using bayesian binning,

    M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” inProceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1, 2015

  18. [18]

    Verification of forecasts expressed in terms of probability,

    W. B. Glennet al., “Verification of forecasts expressed in terms of probability,”Monthly weather review, vol. 78, no. 1, pp. 1–3, 1950

  19. [19]

    GPT-4 Technical Report

    OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2024. [Online]. Available: https://arxiv.org/abs/2303.08774

  20. [20]

    doi: 10.1126/science.abq1158

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, “Competition-level code gene...

  21. [21]

    How Secure is Code Generated by ChatGPT?

    R. Khoury, A. R. Avila, J. Brunelle, B. M. Coutureet al., “How secure is code generated by chatgpt?”arXiv preprint arXiv:2304.09655, 2023

  22. [22]

    Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,

    M. L. Siddiq and J. C. S. Santos, “Securityeval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques,” inProceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security, ser. MSR4P&S 2022, 2022, p. 29–33. [Online]. Available: https://doi.org/10.1145/3549035.3561184

  23. [23]

    CWE Top 25 Most Dangerous Software Weaknesses,

    MITRE Corporation, “CWE Top 25 Most Dangerous Software Weaknesses,” 2023. [Online]. Available: https://cwe.mitre.org/top25/

  24. [24]

    Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,

    J. C. Santos, A. Peruma, M. Mirakhorli, M. Galstery, J. V . Vidal, and A. Sejfia, “Understanding software vulnerabilities related to archi- tectural security tactics: An empirical investigation of chromium, php and thunderbird,” in2017 IEEE International Conference on Software Architecture (ICSA). IEEE, 2017, pp. 69–78

  25. [25]

    Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,

    H. Hajipour, K. Hassler, T. Holz, L. Schönherr, and M. Fritz, “Codelm- sec benchmark: Systematically evaluating and finding security vulnera- bilities in black-box code language models,” inSecond IEEE Conference on Secure and Trustworthy Machine Learning, 2024

  26. [26]

    Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,

    M. L. Siddiq, J. Zhang, L. Roney, and J. C. Santos, “Re(Gex|DoS)eval: Evaluating generated regular expressions and their proneness to dos attacks,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, 2024, pp. 52–56

  27. [27]

    How can we know when language models know? on the calibration of language models for question answering,

    Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know when language models know? on the calibration of language models for question answering,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.57/

  28. [28]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning. PMLR, 2017, pp. 1321–1330

  29. [29]

    Mea- suring calibration in deep learning,

    J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Mea- suring calibration in deep learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, June 2019

  30. [30]

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,

    J. Plattet al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Tech. Rep. 3, 1999

  31. [31]

    Predicting good probabilities with supervised learning,

    A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” inProceedings of the 22nd international conference on Machine learning, 2005, pp. 625–632

  32. [32]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnsonet al., “Language models (mostly) know what they know,”arXiv preprint arXiv:2207.05221, 2022

  33. [33]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    L. Kuhn, Y . Gal, and S. Farquhar, “Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,” arXiv preprint arXiv:2302.09664, 2023

  34. [34]

    Teaching models to express their uncertainty in words,

    S. Lin, J. Hilton, and O. Evans, “Teaching models to express their uncertainty in words,”Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id= 8s8K2UZGTZ

  35. [35]

    Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,

    K. Tian, E. Mitchell, A. Zhou, A. Sharma, R. Rafailov, H. Yao, C. Finn, and C. Manning, “Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Dec....

  36. [36]

    Self-consistency improves chain of thought reasoning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdh- ery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” 2022

  37. [37]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22, 2022

  38. [38]

    Calibration of pre-trained transformers,

    S. Desai and G. Durrett, “Calibration of pre-trained transformers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds., Nov. 2020, pp. 295–302. [Online]. Available: https://aclanthology.org/2020.emnlp-main.21/

  39. [39]

    Toward trustworthy neural program synthesis,

    D. Key, W.-D. Li, and K. Ellis, “Toward trustworthy neural program synthesis,” 2023. [Online]. Available: https://arxiv.org/abs/2210.00848

  40. [40]

    On calibration of pre-trained code models,

    Z. Zhou, C. Sha, and X. Peng, “On calibration of pre-trained code models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639126

  41. [41]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,”arXiv preprint arXiv:2305.01210, 2023

  42. [42]

    GPT-4o-mini Technical Overview,

    OpenAI, “GPT-4o-mini Technical Overview,” https://openai.com, 2024, accessed February 2026

  43. [43]

    Gemini Technical Report,

    Google DeepMind, “Gemini Technical Report,” https://ai.google.dev, 2024, gemini-2.0-Flash API, accessed February 2026

  44. [44]

    Qwen3-coder-next technical report,

    Qwen Team, “Qwen3-coder-next technical report,” Tech. Rep., accessed: 2026-02-03. [Online]. Available: https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf

  45. [45]

    Breaking the silence: the threats of using llms in software engineering,

    J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in software engineering,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE-NIER’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 102–106. [Online]. Available: htt...

  46. [46]

    Classifier calibration with roc- regularized isotonic regression,

    E. Berta, F. Bach, and M. Jordan, “Classifier calibration with roc- regularized isotonic regression,” inInternational Conference on Artificial Intelligence and Statistics. PMLR, 2024, pp. 1972–1980