pith. sign in

arxiv: 2606.31630 · v1 · pith:3M3KGHQEnew · submitted 2026-06-30 · 💻 cs.LG

Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models

Pith reviewed 2026-07-01 06:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords probabilistic programminglanguage modelsmodel misspecificationBayesian workflowposterior predictive checkssimulation-based calibrationprogram repair
0
0 comments X

The pith

Calibration detects and repairs statistical errors in LLM-written probabilistic programs that compilation and unit tests miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models generate probabilistic programs that compile and pass unit tests yet remain statistically wrong, for example by using a Gaussian likelihood on heavy-tailed data or an invalid prior. The paper establishes that the Bayesian workflow supplies the correct verifier through posterior predictive checks, simulation-based calibration, sampler diagnostics, and predictive density. On a benchmark of 14 misspecification types this calibration oracle flags bugs with AUC 0.97 and reaches 62-78 percent even in a reference-free setting, while unit tests detect none. When the same signals are fed back into LLM repair loops, success rates rise sharply on strong models and unit-test feedback proves actively harmful by inducing false confidence. In programs written from scratch, 15-47 percent of runnable outputs are misspecified and calibration-guided repair outperforms alternative review methods.

Core claim

The central claim is that correctness for probabilistic programs is calibration, not compilation: the Bayesian workflow oracle detects 14 classes of statistical misspecification with high accuracy on 200 benchmark instances and, when supplied as feedback, produces large gains in LLM repair success while unit-test feedback reduces performance below the no-feedback baseline.

What carries the argument

The calibration oracle formed by posterior predictive checks, simulation-based calibration, sampler diagnostics (R-hat, divergences, ESS), and held-out predictive density.

If this is right

  • Unit-test feedback is worse than no feedback for repair because passing tests suppress further editing.
  • Calibration-guided repair raises success from 33 percent to 92 percent on GPT-5.1 and from 75 percent to 100 percent on Claude.
  • Fifteen to 47 percent of runnable probabilistic programs written by LLMs from neutral briefs are statistically misspecified.
  • Calibration-guided repair outperforms LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLM coding assistants for statistical modeling would benefit from built-in calibration checks rather than relying solely on syntax and unit tests.
  • Reference-free calibration could be strengthened by expanding the automated model search used to reach the 78 percent detection rate.
  • The same calibration signals might be adapted to detect misspecification in other model classes that LLMs generate, such as differential equations or agent-based simulations.

Load-bearing premise

The 14 hand-defined misspecification types and the 10 chosen model families are representative of the statistical mistakes language models actually make when writing probabilistic programs.

What would settle it

Apply the reference-free calibration procedure to a fresh collection of LLM-written probabilistic programs drawn from open-ended prompts and measure whether the programs it flags fail expert posterior predictive checks at a rate significantly above the programs it accepts.

Figures

Figures reproduced from arXiv: 2606.31630 by Delu Zeng, Jian Xu, John Paisley, Qibin Zhao.

Figure 1
Figure 1. Figure 1: Calibration-as-oracle repair. The verifier is the Bayesian workflow, not a test suite. An LLM drafts a [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detection rate by misspecification type (calibration oracle vs. unit test) over 200 instances. Calibration [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Diagnostic advantage on code-invisible bugs—the fix rate with calibration feedback minus the better of the two baselines (no feedback / unit test)—per writer model (raw rates in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostic→bug-type confusion (row￾normalized). Errors concentrate on the shared “under￾dispersed predictive” symptom (over-dispersion / in￾valid support / prior–data conflict); wrong-likelihood and bad-parameterization are cleanly identified. curacy across seven classes (n=176 detected bugs). The confusion matrix ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Repair of real LLM-written misspecified pro￾grams, pooled over the seven capable writers (n=98). Calibration feedback significantly outperforms unit tests, data-summary self-debug, a Bayesian checklist, and LLM-as-judge code review (all p < 0.05, paired Mc￾Nemar). Failure Modes of Calibration-as￾Oracle Calibration is not a panacea, and our data expose where it breaks. • Misspecification invisible to the ch… view at source ↗
Figure 6
Figure 6. Figure 6: ROC of the calibration oracle as a continuous [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that language models writing probabilistic programs (NumPyro, Stan, Pyro) frequently produce code that compiles and passes unit tests yet remains statistically misspecified (e.g., wrong likelihood family, invalid support, pathological parameterization). It argues that the appropriate verifier is the Bayesian workflow—posterior predictive checks, simulation-based calibration, sampler diagnostics—rather than compilation or unit tests. On a benchmark of 14 hand-defined misspecification types across 10 model families (200 instances), a calibration oracle detects bugs at AUC 0.97 (88% at 2% FPR with reference program; reference-free version 62–78%). When used as feedback in LLM repair loops, calibration outperforms unit-test feedback (which can be worse than no feedback) and improves success rates on strong models (GPT-5.1 33→92%, Claude 75→100%; paired McNemar n=228). In from-scratch generation for neutral briefs, 15–47% of runnable programs are misspecified (unit tests catch none), and calibration-guided repair beats LLM-as-judge, checklists, and data-summary baselines.

Significance. If the central empirical claims hold after addressing benchmark-construction details, the work supplies concrete evidence that unit testing is actively misleading for probabilistic programs and that calibration-based feedback yields measurable repair gains. The observation that passing unit tests can suppress further repair is a useful cautionary result. The from-scratch evaluation adds a practical dimension, though its generalizability depends on how representative the 14 types prove to be.

major comments (3)
  1. [§4.1] §4.1 (Benchmark construction): The paper provides insufficient detail on how the 200 instances were generated, how the 14 misspecification types were injected into the 10 model families, and whether the reference-free automated model search was tuned or validated on the same data used for the reported AUC 0.97 and 62–78% figures. This information is load-bearing for assessing whether the detection results are free of overfitting or data leakage.
  2. [§5.3] §5.3 and Table 3 (Repair-loop experiments): The paired McNemar tests (n=228) show calibration outperforming unit-test feedback, but the manuscript does not state whether the 15–47% misspecification rate observed in the from-scratch setting was measured on held-out model families or briefs independent of the synthetic benchmark; overlap would undermine the claim that calibration generalizes beyond the curated 14 types.
  3. [§6] §6 (Reality check): The evaluation of calibration-guided repair versus LLM-as-judge and Bayesian-workflow checklist on from-scratch programs requires an explicit protocol for identifying ground-truth misspecifications when no reference program exists; without it, the reported superiority cannot be independently verified and remains central to the broader claim that 'correctness is calibration, not compilation.'
minor comments (2)
  1. [Abstract] Abstract and §2: Model names such as 'GPT-5.1' should be replaced by exact version strings (e.g., gpt-4o-2024-08-06) for reproducibility.
  2. [Figure 1] Figure 1 and Figure 3: Axis labels and legend entries use inconsistent abbreviations for the 14 misspecification types; a single glossary table would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that additional methodological transparency is required on benchmark construction, evaluation independence, and ground-truth protocols. We will revise the manuscript to incorporate these clarifications, which strengthen rather than undermine the central claims.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (Benchmark construction): The paper provides insufficient detail on how the 200 instances were generated, how the 14 misspecification types were injected into the 10 model families, and whether the reference-free automated model search was tuned or validated on the same data used for the reported AUC 0.97 and 62–78% figures. This information is load-bearing for assessing whether the detection results are free of overfitting or data leakage.

    Authors: We agree that the current description of benchmark construction is insufficient. In the revised manuscript we will expand §4.1 with a full protocol: the 14 misspecification types were manually enumerated from common statistical errors (wrong likelihood family, invalid support, pathological parameterization, etc.); each type was injected into reference programs drawn from 10 model families by deterministic code transformations; the 200 instances were produced by crossing the modified programs with varied data sizes and parameter draws, each paired with its correct reference. The reference-free automated model search used a disjoint validation split (20 % of instances) for any internal decisions; the reported AUC and percentage figures were computed exclusively on the remaining held-out test instances. No tuning or selection occurred on the test data. We will include pseudocode, a table of injection rules, and the exact train/validation/test split sizes. revision: yes

  2. Referee: [§5.3] §5.3 and Table 3 (Repair-loop experiments): The paired McNemar tests (n=228) show calibration outperforming unit-test feedback, but the manuscript does not state whether the 15–47% misspecification rate observed in the from-scratch setting was measured on held-out model families or briefs independent of the synthetic benchmark; overlap would undermine the claim that calibration generalizes beyond the curated 14 types.

    Authors: The from-scratch briefs and model families were deliberately chosen to be disjoint from those used to construct the 14-type synthetic benchmark. We will add an explicit statement in §5.3 and the Table 3 caption confirming this separation, together with the list of held-out briefs, so that readers can verify independence and the generalization claim. revision: yes

  3. Referee: [§6] §6 (Reality check): The evaluation of calibration-guided repair versus LLM-as-judge and Bayesian-workflow checklist on from-scratch programs requires an explicit protocol for identifying ground-truth misspecifications when no reference program exists; without it, the reported superiority cannot be independently verified and remains central to the broader claim that 'correctness is calibration, not compilation.'

    Authors: We accept that an explicit ground-truth protocol is required. In the revision we will add a dedicated paragraph in §6 describing the protocol: misspecifications were flagged by (i) posterior predictive checks against held-out data, (ii) simulation-based calibration for parameter recovery, and (iii) independent review by two statisticians (inter-rater agreement 92 %). Disagreements were resolved by joint re-examination. The protocol does not rely on reference programs and the raw diagnostic outputs will be released with the code. This allows independent replication of the superiority claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results measured on externally defined benchmark

full rationale

The paper's central claims rest on performance metrics (AUC 0.97 detection, repair gains from 33→92%) evaluated against a hand-curated benchmark of 14 misspecification types and 200 instances across 10 model families, plus held-out LLM generations from neutral briefs. These inputs are defined independently of the calibration methods being tested; no equations, fitted parameters, or self-citations reduce the reported AUCs or success rates to tautological re-expressions of the same quantities. The reference-free detector and real-world misspecification rates (15–47%) are likewise computed on external data without self-referential closure. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Bayesian workflow diagnostics constitute a reliable oracle for statistical misspecification and that the constructed benchmark faithfully samples the error distribution of LLM-generated programs.

axioms (1)
  • domain assumption Posterior predictive checks, simulation-based calibration, and sampler diagnostics (R-hat, divergences, ESS) are sufficient to detect the 14 listed forms of statistical misspecification.
    Invoked when the paper positions the Bayesian workflow as the right verifier instead of unit tests.

pith-pipeline@v0.9.1-grok · 5913 in / 1359 out tokens · 33313 ms · 2026-07-01T06:14:40.586624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2508.03766 , year=

    Llm-prior: A framework for knowledge-driven prior elicitation and aggregation , author=. arXiv preprint arXiv:2508.03766 , year=

  2. [2]

    A Conceptual Introduction to Hamiltonian Monte Carlo

    A conceptual introduction to Hamiltonian Monte Carlo , author=. arXiv preprint arXiv:1701.02434 , year=

  3. [3]

    Journal of machine learning research , volume=

    Pyro: Deep universal probabilistic programming , author=. Journal of machine learning research , volume=

  4. [4]

    Journal of statistical software , volume=

    Stan: A probabilistic programming language , author=. Journal of statistical software , volume=

  5. [5]

    Statistica sinica , pages=

    Posterior predictive assessment of model fitness via realized discrepancies , author=. Statistica sinica , pages=. 1996 , publisher=

  6. [6]

    arXiv preprint arXiv:2011.01808 , year=

    Bayesian workflow , author=. arXiv preprint arXiv:2011.01808 , year=

  7. [7]

    Journal of the American statistical Association , volume=

    Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

  8. [8]

    , author=

    The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. J. Mach. Learn. Res. , volume=

  9. [9]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  10. [10]

    Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

    Composable effects for flexible and accelerated probabilistic programming in NumPyro , author=. arXiv preprint arXiv:1912.11554 , year=

  11. [11]

    Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

    Textual Bayes: Quantifying uncertainty in LLM-based systems , author=. arXiv preprint arXiv:2506.10060 , year=

  12. [12]

    arXiv preprint arXiv:1804.06788 , year=

    Validating Bayesian inference algorithms with simulation-based calibration , author=. arXiv preprint arXiv:1804.06788 , year=

  13. [13]

    Rank-normalization, folding, and localization: An improved

    Vehtari, Aki and Gelman, Andrew and Simpson, Daniel and Carpenter, Bob and B. Rank-normalization, folding, and localization: An improved. Bayesian analysis , volume=. 2021 , publisher=