Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models
Pith reviewed 2026-07-01 06:14 UTC · model grok-4.3
The pith
Calibration detects and repairs statistical errors in LLM-written probabilistic programs that compilation and unit tests miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that correctness for probabilistic programs is calibration, not compilation: the Bayesian workflow oracle detects 14 classes of statistical misspecification with high accuracy on 200 benchmark instances and, when supplied as feedback, produces large gains in LLM repair success while unit-test feedback reduces performance below the no-feedback baseline.
What carries the argument
The calibration oracle formed by posterior predictive checks, simulation-based calibration, sampler diagnostics (R-hat, divergences, ESS), and held-out predictive density.
If this is right
- Unit-test feedback is worse than no feedback for repair because passing tests suppress further editing.
- Calibration-guided repair raises success from 33 percent to 92 percent on GPT-5.1 and from 75 percent to 100 percent on Claude.
- Fifteen to 47 percent of runnable probabilistic programs written by LLMs from neutral briefs are statistically misspecified.
- Calibration-guided repair outperforms LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug.
Where Pith is reading between the lines
- LLM coding assistants for statistical modeling would benefit from built-in calibration checks rather than relying solely on syntax and unit tests.
- Reference-free calibration could be strengthened by expanding the automated model search used to reach the 78 percent detection rate.
- The same calibration signals might be adapted to detect misspecification in other model classes that LLMs generate, such as differential equations or agent-based simulations.
Load-bearing premise
The 14 hand-defined misspecification types and the 10 chosen model families are representative of the statistical mistakes language models actually make when writing probabilistic programs.
What would settle it
Apply the reference-free calibration procedure to a fresh collection of LLM-written probabilistic programs drawn from open-ended prompts and measure whether the programs it flags fail expert posterior predictive checks at a rate significantly above the programs it accepts.
Figures
read the original abstract
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that language models writing probabilistic programs (NumPyro, Stan, Pyro) frequently produce code that compiles and passes unit tests yet remains statistically misspecified (e.g., wrong likelihood family, invalid support, pathological parameterization). It argues that the appropriate verifier is the Bayesian workflow—posterior predictive checks, simulation-based calibration, sampler diagnostics—rather than compilation or unit tests. On a benchmark of 14 hand-defined misspecification types across 10 model families (200 instances), a calibration oracle detects bugs at AUC 0.97 (88% at 2% FPR with reference program; reference-free version 62–78%). When used as feedback in LLM repair loops, calibration outperforms unit-test feedback (which can be worse than no feedback) and improves success rates on strong models (GPT-5.1 33→92%, Claude 75→100%; paired McNemar n=228). In from-scratch generation for neutral briefs, 15–47% of runnable programs are misspecified (unit tests catch none), and calibration-guided repair beats LLM-as-judge, checklists, and data-summary baselines.
Significance. If the central empirical claims hold after addressing benchmark-construction details, the work supplies concrete evidence that unit testing is actively misleading for probabilistic programs and that calibration-based feedback yields measurable repair gains. The observation that passing unit tests can suppress further repair is a useful cautionary result. The from-scratch evaluation adds a practical dimension, though its generalizability depends on how representative the 14 types prove to be.
major comments (3)
- [§4.1] §4.1 (Benchmark construction): The paper provides insufficient detail on how the 200 instances were generated, how the 14 misspecification types were injected into the 10 model families, and whether the reference-free automated model search was tuned or validated on the same data used for the reported AUC 0.97 and 62–78% figures. This information is load-bearing for assessing whether the detection results are free of overfitting or data leakage.
- [§5.3] §5.3 and Table 3 (Repair-loop experiments): The paired McNemar tests (n=228) show calibration outperforming unit-test feedback, but the manuscript does not state whether the 15–47% misspecification rate observed in the from-scratch setting was measured on held-out model families or briefs independent of the synthetic benchmark; overlap would undermine the claim that calibration generalizes beyond the curated 14 types.
- [§6] §6 (Reality check): The evaluation of calibration-guided repair versus LLM-as-judge and Bayesian-workflow checklist on from-scratch programs requires an explicit protocol for identifying ground-truth misspecifications when no reference program exists; without it, the reported superiority cannot be independently verified and remains central to the broader claim that 'correctness is calibration, not compilation.'
minor comments (2)
- [Abstract] Abstract and §2: Model names such as 'GPT-5.1' should be replaced by exact version strings (e.g., gpt-4o-2024-08-06) for reproducibility.
- [Figure 1] Figure 1 and Figure 3: Axis labels and legend entries use inconsistent abbreviations for the 14 misspecification types; a single glossary table would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We agree that additional methodological transparency is required on benchmark construction, evaluation independence, and ground-truth protocols. We will revise the manuscript to incorporate these clarifications, which strengthen rather than undermine the central claims.
read point-by-point responses
-
Referee: [§4.1] §4.1 (Benchmark construction): The paper provides insufficient detail on how the 200 instances were generated, how the 14 misspecification types were injected into the 10 model families, and whether the reference-free automated model search was tuned or validated on the same data used for the reported AUC 0.97 and 62–78% figures. This information is load-bearing for assessing whether the detection results are free of overfitting or data leakage.
Authors: We agree that the current description of benchmark construction is insufficient. In the revised manuscript we will expand §4.1 with a full protocol: the 14 misspecification types were manually enumerated from common statistical errors (wrong likelihood family, invalid support, pathological parameterization, etc.); each type was injected into reference programs drawn from 10 model families by deterministic code transformations; the 200 instances were produced by crossing the modified programs with varied data sizes and parameter draws, each paired with its correct reference. The reference-free automated model search used a disjoint validation split (20 % of instances) for any internal decisions; the reported AUC and percentage figures were computed exclusively on the remaining held-out test instances. No tuning or selection occurred on the test data. We will include pseudocode, a table of injection rules, and the exact train/validation/test split sizes. revision: yes
-
Referee: [§5.3] §5.3 and Table 3 (Repair-loop experiments): The paired McNemar tests (n=228) show calibration outperforming unit-test feedback, but the manuscript does not state whether the 15–47% misspecification rate observed in the from-scratch setting was measured on held-out model families or briefs independent of the synthetic benchmark; overlap would undermine the claim that calibration generalizes beyond the curated 14 types.
Authors: The from-scratch briefs and model families were deliberately chosen to be disjoint from those used to construct the 14-type synthetic benchmark. We will add an explicit statement in §5.3 and the Table 3 caption confirming this separation, together with the list of held-out briefs, so that readers can verify independence and the generalization claim. revision: yes
-
Referee: [§6] §6 (Reality check): The evaluation of calibration-guided repair versus LLM-as-judge and Bayesian-workflow checklist on from-scratch programs requires an explicit protocol for identifying ground-truth misspecifications when no reference program exists; without it, the reported superiority cannot be independently verified and remains central to the broader claim that 'correctness is calibration, not compilation.'
Authors: We accept that an explicit ground-truth protocol is required. In the revision we will add a dedicated paragraph in §6 describing the protocol: misspecifications were flagged by (i) posterior predictive checks against held-out data, (ii) simulation-based calibration for parameter recovery, and (iii) independent review by two statisticians (inter-rater agreement 92 %). Disagreements were resolved by joint re-examination. The protocol does not rely on reference programs and the raw diagnostic outputs will be released with the code. This allows independent replication of the superiority claims. revision: yes
Circularity Check
No circularity: empirical results measured on externally defined benchmark
full rationale
The paper's central claims rest on performance metrics (AUC 0.97 detection, repair gains from 33→92%) evaluated against a hand-curated benchmark of 14 misspecification types and 200 instances across 10 model families, plus held-out LLM generations from neutral briefs. These inputs are defined independently of the calibration methods being tested; no equations, fitted parameters, or self-citations reduce the reported AUCs or success rates to tautological re-expressions of the same quantities. The reference-free detector and real-world misspecification rates (15–47%) are likewise computed on external data without self-referential closure. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Posterior predictive checks, simulation-based calibration, and sampler diagnostics (R-hat, divergences, ESS) are sufficient to detect the 14 listed forms of statistical misspecification.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2508.03766 , year=
Llm-prior: A framework for knowledge-driven prior elicitation and aggregation , author=. arXiv preprint arXiv:2508.03766 , year=
-
[2]
A Conceptual Introduction to Hamiltonian Monte Carlo
A conceptual introduction to Hamiltonian Monte Carlo , author=. arXiv preprint arXiv:1701.02434 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Journal of machine learning research , volume=
Pyro: Deep universal probabilistic programming , author=. Journal of machine learning research , volume=
-
[4]
Journal of statistical software , volume=
Stan: A probabilistic programming language , author=. Journal of statistical software , volume=
-
[5]
Statistica sinica , pages=
Posterior predictive assessment of model fitness via realized discrepancies , author=. Statistica sinica , pages=. 1996 , publisher=
1996
-
[6]
arXiv preprint arXiv:2011.01808 , year=
Bayesian workflow , author=. arXiv preprint arXiv:2011.01808 , year=
-
[7]
Journal of the American statistical Association , volume=
Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=
2007
-
[8]
, author=
The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. J. Mach. Learn. Res. , volume=
-
[9]
Advances in neural information processing systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
-
[10]
Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro
Composable effects for flexible and accelerated probabilistic programming in NumPyro , author=. arXiv preprint arXiv:1912.11554 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[11]
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems
Textual Bayes: Quantifying uncertainty in LLM-based systems , author=. arXiv preprint arXiv:2506.10060 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
arXiv preprint arXiv:1804.06788 , year=
Validating Bayesian inference algorithms with simulation-based calibration , author=. arXiv preprint arXiv:1804.06788 , year=
-
[13]
Rank-normalization, folding, and localization: An improved
Vehtari, Aki and Gelman, Andrew and Simpson, Daniel and Carpenter, Bob and B. Rank-normalization, folding, and localization: An improved. Bayesian analysis , volume=. 2021 , publisher=
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.