The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
Pith reviewed 2026-05-18 22:05 UTC · model grok-4.3
The pith
LLMs show lower perplexity when completing code in strongly-typed languages than in dynamically-typed ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Shell appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM; under a fixed model, relative language-level rankings are moderately stable across evaluation corpora. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence.
What carries the argument
Code perplexity, an intrinsic metric that quantifies an LLM's uncertainty when predicting successive tokens in a code sequence.
If this is right
- LLM code completion may be more dependable in projects that use strongly-typed languages.
- Shell scripts and dynamic languages may carry higher risk of uncertain or incorrect LLM outputs.
- Switching models changes absolute confidence levels but preserves most language rankings.
- Code comments raise perplexity without rearranging which languages rank as easier or harder.
Where Pith is reading between the lines
- Teams could favor strongly-typed languages when heavy reliance on LLM assistants is planned.
- IDE tools might surface perplexity scores to let developers gauge how much to trust a suggestion.
- Similar measurements on other uncertainty metrics such as entropy could refine these language comparisons.
Load-bearing premise
Intrinsic metrics such as perplexity can serve as proxies for functional correctness and hallucination risk in LLM-generated code.
What would settle it
A direct measurement showing that lower-perplexity code completions are no more likely to pass functional tests or avoid hallucinations than higher-perplexity ones.
read the original abstract
Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 2254 files from 881 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Shell appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM; under a fixed model, relative language-level rankings are moderately stable across evaluation corpora. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical evaluation of LLM confidence in code completion via perplexity measurements across programming languages, models, and datasets. Using a sample of 2254 files from 881 GitHub projects, it finds that strongly-typed languages show lower perplexity than dynamically-typed languages, scripting languages exhibit higher perplexity, Shell has universally high perplexity while Java is low, and that perplexity depends on the specific LLM but relative language rankings remain moderately stable across evaluation corpora. Code comments tend to increase perplexity without substantially altering language rankings. The authors position these intrinsic metrics as proxies for functional correctness and hallucination risk to guide practical use of LLM-based code completion.
Significance. If the reported perplexity rankings hold, the work offers descriptive insights into how language typing, scripting nature, and model choice influence LLM uncertainty on real GitHub code. The observed stability of relative rankings across models and corpora is a clear strength that could inform model-agnostic guidance. However, the absence of any validation linking lower perplexity to improved functional correctness or reduced hallucinations limits the significance of the practical recommendations for developers and researchers.
major comments (2)
- [Abstract] Abstract: the claim that intrinsic metrics such as perplexity 'can serve as proxies for functional correctness and hallucination risk in LLM-generated code' is presented as motivation but receives no supporting analysis (e.g., correlation with pass@1 rates, hallucination counts, or downstream task performance) on the 2254-file corpus. This assumption is load-bearing for the stated utility of the language and model rankings.
- [Abstract] Abstract and implied Methods: no details are provided on the exact perplexity computation (e.g., tokenization handling, context window usage, or normalization), statistical significance tests, or controls for confounders such as file size and tokenization differences across languages. These omissions affect interpretability of the reported language-level differences.
minor comments (1)
- The manuscript would benefit from explicit definitions or references for how perplexity, entropy, and mutual information are computed in the code-completion setting.
Simulated Author's Rebuttal
We thank the referee for their valuable comments. Below we provide a point-by-point response to the major comments and outline the changes we intend to implement in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that intrinsic metrics such as perplexity 'can serve as proxies for functional correctness and hallucination risk in LLM-generated code' is presented as motivation but receives no supporting analysis (e.g., correlation with pass@1 rates, hallucination counts, or downstream task performance) on the 2254-file corpus. This assumption is load-bearing for the stated utility of the language and model rankings.
Authors: We recognize that the manuscript does not provide empirical validation linking perplexity to functional correctness or hallucination rates on the evaluated corpus. The statement in the abstract is intended as a general motivation drawn from the broader literature on LLM uncertainty estimation. To address this concern, we will revise the abstract to present this as a potential application rather than an established fact, and we will add a discussion of this limitation in the paper. This revision will clarify the scope of our contributions, which focus on characterizing perplexity differences across languages and models. revision: yes
-
Referee: [Abstract] Abstract and implied Methods: no details are provided on the exact perplexity computation (e.g., tokenization handling, context window usage, or normalization), statistical significance tests, or controls for confounders such as file size and tokenization differences across languages. These omissions affect interpretability of the reported language-level differences.
Authors: We agree that the current manuscript lacks sufficient detail on the perplexity calculation procedure and related methodological aspects. In the revised version, we will provide a detailed description of the perplexity computation, including specifics on tokenization, context handling, and normalization. Additionally, we will incorporate statistical tests to assess the significance of the observed differences and perform analyses to control for potential confounding factors such as file length and tokenization variations across languages. revision: yes
Circularity Check
No circularity: direct empirical measurements of perplexity
full rationale
The paper conducts an empirical study computing perplexity on a fixed external sample of 2254 GitHub files across languages and models. No derivations, equations, or predictions are presented that reduce to fitted parameters or self-referential definitions. Language rankings and stability observations are direct statistical summaries of the measured values. The proxy role of perplexity for correctness is stated as an initial motivation without any internal derivation or self-citation chain that would make the reported results tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Perplexity, entropy, and mutual information serve as proxies for functional correctness and hallucination risk in LLM-generated code.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
intrinsic metrics such as perplexity, entropy, and mutual information ... can serve as proxies for functional correctness and hallucination risk in LLM-generated code
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.