The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

Diomidis Spinellis; Konstantina Dritsa; Panos Louridas; Zoe Kotti

arxiv: 2508.16131 · v2 · submitted 2025-08-22 · 💻 cs.SE · cs.AI

The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

Zoe Kotti , Konstantina Dritsa , Diomidis Spinellis , Panos Louridas This is my paper

Pith reviewed 2026-05-18 22:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM confidencecode completionperplexityprogramming languagesstrongly-typeddynamically-typedhallucination riskmodel uncertainty

0 comments

The pith

LLMs show lower perplexity when completing code in strongly-typed languages than in dynamically-typed ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures LLM confidence during code completion by tracking perplexity across languages, models, and real GitHub files. It reports that strongly-typed languages produce lower perplexity scores than dynamically-typed and scripting languages, with Shell consistently high and Java low. For any fixed model the relative ordering of languages stays fairly stable across different code collections, and comments do not change that ordering much. The work positions these intrinsic measures as practical signals for deciding when LLM code completion is likely to be more reliable in a given project.

Core claim

Strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Shell appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM; under a fixed model, relative language-level rankings are moderately stable across evaluation corpora. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence.

What carries the argument

Code perplexity, an intrinsic metric that quantifies an LLM's uncertainty when predicting successive tokens in a code sequence.

If this is right

LLM code completion may be more dependable in projects that use strongly-typed languages.
Shell scripts and dynamic languages may carry higher risk of uncertain or incorrect LLM outputs.
Switching models changes absolute confidence levels but preserves most language rankings.
Code comments raise perplexity without rearranging which languages rank as easier or harder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could favor strongly-typed languages when heavy reliance on LLM assistants is planned.
IDE tools might surface perplexity scores to let developers gauge how much to trust a suggestion.
Similar measurements on other uncertainty metrics such as entropy could refine these language comparisons.

Load-bearing premise

Intrinsic metrics such as perplexity can serve as proxies for functional correctness and hallucination risk in LLM-generated code.

What would settle it

A direct measurement showing that lower-perplexity code completions are no more likely to pass functional tests or avoid hallucinations than higher-perplexity ones.

read the original abstract

Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 2254 files from 881 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Shell appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM; under a fixed model, relative language-level rankings are moderately stable across evaluation corpora. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives fresh comparative perplexity numbers across languages on real GitHub code but leaves the core proxy claim untested.

read the letter

The key takeaway is that this work measures perplexity on 2254 files from 881 GitHub projects and reports lower values for strongly-typed languages than for dynamic or scripting ones, with Java consistently low and Shell high. Rankings stay fairly stable when the model is fixed and across different corpora, and comments raise perplexity without flipping the language order. That is the concrete addition: new empirical rankings rather than a new method or framework. The authors also show that perplexity depends on the specific LLM, which is worth knowing for anyone picking a model for code completion. The sample size and the checks on corpus stability and comments are the parts that feel solid and worth having in the record. The main limitation is that the motivation and the practical advice rest on the idea that lower perplexity signals better functional correctness or lower hallucination risk, yet the paper does not test that link. There is no correlation with pass rates, no comparison to actual generated code quality, and no regression that controls for file size or tokenization. Without that step the guidance on language and model choice stays suggestive rather than demonstrated. The citation pattern looks standard for the area and does not lean on self-reference in a problematic way. This is the kind of paper that belongs in a reading group focused on LLM evaluation for software engineering. Readers who want data points on how current models behave across languages will find it useful even if they treat the proxy claim as provisional. It is coherent on its own terms and deserves a serious referee who can check the exact perplexity computation and ask for at least a small validation against a downstream metric. I would send it to review after a revision that either adds that check or narrows the utility claims.

Referee Report

2 major / 1 minor

Summary. The manuscript reports an empirical evaluation of LLM confidence in code completion via perplexity measurements across programming languages, models, and datasets. Using a sample of 2254 files from 881 GitHub projects, it finds that strongly-typed languages show lower perplexity than dynamically-typed languages, scripting languages exhibit higher perplexity, Shell has universally high perplexity while Java is low, and that perplexity depends on the specific LLM but relative language rankings remain moderately stable across evaluation corpora. Code comments tend to increase perplexity without substantially altering language rankings. The authors position these intrinsic metrics as proxies for functional correctness and hallucination risk to guide practical use of LLM-based code completion.

Significance. If the reported perplexity rankings hold, the work offers descriptive insights into how language typing, scripting nature, and model choice influence LLM uncertainty on real GitHub code. The observed stability of relative rankings across models and corpora is a clear strength that could inform model-agnostic guidance. However, the absence of any validation linking lower perplexity to improved functional correctness or reduced hallucinations limits the significance of the practical recommendations for developers and researchers.

major comments (2)

[Abstract] Abstract: the claim that intrinsic metrics such as perplexity 'can serve as proxies for functional correctness and hallucination risk in LLM-generated code' is presented as motivation but receives no supporting analysis (e.g., correlation with pass@1 rates, hallucination counts, or downstream task performance) on the 2254-file corpus. This assumption is load-bearing for the stated utility of the language and model rankings.
[Abstract] Abstract and implied Methods: no details are provided on the exact perplexity computation (e.g., tokenization handling, context window usage, or normalization), statistical significance tests, or controls for confounders such as file size and tokenization differences across languages. These omissions affect interpretability of the reported language-level differences.

minor comments (1)

The manuscript would benefit from explicit definitions or references for how perplexity, entropy, and mutual information are computed in the code-completion setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments. Below we provide a point-by-point response to the major comments and outline the changes we intend to implement in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that intrinsic metrics such as perplexity 'can serve as proxies for functional correctness and hallucination risk in LLM-generated code' is presented as motivation but receives no supporting analysis (e.g., correlation with pass@1 rates, hallucination counts, or downstream task performance) on the 2254-file corpus. This assumption is load-bearing for the stated utility of the language and model rankings.

Authors: We recognize that the manuscript does not provide empirical validation linking perplexity to functional correctness or hallucination rates on the evaluated corpus. The statement in the abstract is intended as a general motivation drawn from the broader literature on LLM uncertainty estimation. To address this concern, we will revise the abstract to present this as a potential application rather than an established fact, and we will add a discussion of this limitation in the paper. This revision will clarify the scope of our contributions, which focus on characterizing perplexity differences across languages and models. revision: yes
Referee: [Abstract] Abstract and implied Methods: no details are provided on the exact perplexity computation (e.g., tokenization handling, context window usage, or normalization), statistical significance tests, or controls for confounders such as file size and tokenization differences across languages. These omissions affect interpretability of the reported language-level differences.

Authors: We agree that the current manuscript lacks sufficient detail on the perplexity calculation procedure and related methodological aspects. In the revised version, we will provide a detailed description of the perplexity computation, including specifics on tokenization, context handling, and normalization. Additionally, we will incorporate statistical tests to assess the significance of the observed differences and perform analyses to control for potential confounding factors such as file length and tokenization variations across languages. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements of perplexity

full rationale

The paper conducts an empirical study computing perplexity on a fixed external sample of 2254 GitHub files across languages and models. No derivations, equations, or predictions are presented that reduce to fitted parameters or self-referential definitions. Language rankings and stability observations are direct statistical summaries of the measured values. The proxy role of perplexity for correctness is stated as an initial motivation without any internal derivation or self-citation chain that would make the reported results tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central findings rest on the domain assumption that perplexity reliably proxies functional correctness and hallucination risk, plus the choice of GitHub sample as representative of real code.

axioms (1)

domain assumption Perplexity, entropy, and mutual information serve as proxies for functional correctness and hallucination risk in LLM-generated code.
Explicitly stated in the abstract as motivation for using intrinsic metrics instead of downstream evaluation.

pith-pipeline@v0.9.0 · 5841 in / 1175 out tokens · 34006 ms · 2026-05-18T22:05:02.515856+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

intrinsic metrics such as perplexity, entropy, and mutual information ... can serve as proxies for functional correctness and hallucination risk in LLM-generated code

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.