LOO-PIT predictive model checking
Pith reviewed 2026-05-15 17:13 UTC · model grok-4.3
The pith
Leave-one-out PIT values are dependent in finite samples, so standard uniformity tests for Bayesian model calibration have lower power than expected.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a well-calibrated model, LOO-PIT values should be near uniformly distributed, but in the finite sample case they are not independent, due to LOO predictive distributions being determined by nearly the same data (all but one observation). We prove that this dependency is non-negligible in the finite case and depends on model complexity. We propose three testing procedures that can be used for continuous and discrete dependent uniform values and an automated graphical method for visualizing local departures from the null.
What carries the argument
The LOO-PIT values together with three new testing procedures constructed specifically for dependent uniform random variables.
If this is right
- Standard uniformity tests that assume independence will reject the null too rarely when the model is miscalibrated.
- The proposed tests maintain competitive size and power across continuous and discrete cases while recovering the usual tests as sample size grows.
- Model assessment that relies on LOO-PIT must incorporate dependence adjustments to avoid under-detecting poor calibration.
- The strength of dependence increases with model complexity, so more flexible models require stronger adjustments to the test.
- An automated graphical procedure can be used alongside the global tests to locate where local departures from uniformity occur.
Where Pith is reading between the lines
- The same dependence issue is likely to affect other leave-one-out or cross-validation diagnostics that treat predictive quantities as independent.
- For very large datasets the dependence vanishes, so the new tests converge to the classical ones without extra computation.
- Software implementations could embed these tests as default options for routine Bayesian model checking.
- The approach may extend to checking uniformity of other transforms that inherit similar leave-one-out dependence structures.
Load-bearing premise
The dependence structure induced by LOO predictive distributions can be adequately captured by the three proposed testing procedures without introducing new bias or power loss in realistic finite-sample regimes.
What would settle it
A Monte Carlo experiment on data generated from a deliberately miscalibrated model in which the new dependence-adjusted tests reject uniformity at a rate no higher than the independence-based tests.
read the original abstract
We consider predictive checking for Bayesian model assessment using leave-one-out probability integral transform (LOO-PIT). LOO-PIT values are conditional cumulative predictive probabilities given LOO predictive distributions and corresponding left out observations. For a well-calibrated model, LOO-PIT values should be near uniformly distributed, but in the finite sample case they are not independent, due to LOO predictive distributions being determined by nearly the same data (all but one observation). We prove that this dependency is non-negligible in the finite case and depends on model complexity. We propose three testing procedures that can be used for continuous and discrete dependent uniform values. We also propose an automated graphical method for visualizing local departures from the null. Extensive numerical experiments on simulated and real datasets demonstrate that the proposed tests achieve competitive performance overall and have much higher power than standard uniformity tests based on the independence assumption that inevitably lead to lower than expected rejection rate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LOO-PIT values are dependent in finite samples (with dependence scaling by model complexity), proves this non-negligibility, and introduces three testing procedures for uniformity of dependent uniforms (continuous and discrete) plus an automated graphical diagnostic for local departures. Extensive simulations and real-data experiments are reported to show competitive performance and substantially higher power than standard uniformity tests that assume independence.
Significance. If the central claims hold, the work is significant for Bayesian model assessment: it identifies and corrects an under-appreciated source of conservatism in LOO-PIT checks, supplies valid finite-sample tests that respect the dependence, and adds a practical visualization tool. The simulation evidence of improved power is a concrete strength that directly supports the methodological contribution.
major comments (3)
- [§3] §3 (proof of dependence): the argument that dependence is non-negligible and scales with model complexity is load-bearing for the whole paper, yet the derivation steps and the precise complexity measure used are not shown in sufficient detail to verify the finite-sample claim without external simulation; an explicit bound or leading term would strengthen the result.
- [§4.2] §4.2 (discrete-case procedure): the adjustment for dependence in the test statistic appears to rely on a specific correlation structure derived from the LOO predictive distributions; it is unclear whether this structure remains valid under model misspecification or for non-i.i.d. data, which directly affects the claimed validity for general discrete cases.
- [Table 3] Table 3 (power results, high-complexity row): the reported power advantage is large, but the simulation design fixes n=100 and uses a single complexity proxy; without results for smaller n (where dependence is strongest) or varying effective degrees of freedom, the scaling claim cannot be fully assessed.
minor comments (3)
- [Abstract] Abstract: 'near uniformly distributed' should be replaced by 'uniformly distributed under the null' for precision.
- [Figure 2] Figure 2: axis labels and legend entries for the three proposed procedures are not fully legible; adding a short caption explaining each line style would improve clarity.
- [§5.1] §5.1: the automated graphical method uses a default bandwidth; a brief sensitivity check or recommendation for bandwidth selection would help readers apply the diagnostic reliably.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the detailed, constructive comments. We address each major comment below and have revised the manuscript to incorporate additional details and simulations where feasible.
read point-by-point responses
-
Referee: [§3] §3 (proof of dependence): the argument that dependence is non-negligible and scales with model complexity is load-bearing for the whole paper, yet the derivation steps and the precise complexity measure used are not shown in sufficient detail to verify the finite-sample claim without external simulation; an explicit bound or leading term would strengthen the result.
Authors: We agree that greater transparency in the derivation would strengthen the section. In the revised manuscript we have expanded Section 3 to present the full step-by-step derivation of the pairwise covariance between LOO-PIT values. The leading term is shown explicitly to be of order p/n, where p is the effective model complexity (defined via the trace of the appropriate projection or influence matrix). This bound is derived under standard regularity conditions on the predictive distributions and confirms that the dependence remains non-negligible whenever p is not o(n). revision: yes
-
Referee: [§4.2] §4.2 (discrete-case procedure): the adjustment for dependence in the test statistic appears to rely on a specific correlation structure derived from the LOO predictive distributions; it is unclear whether this structure remains valid under model misspecification or for non-i.i.d. data, which directly affects the claimed validity for general discrete cases.
Authors: The correlation adjustment is obtained from the joint distribution of the LOO predictive cdfs under the working model. We have added a clarifying paragraph in Section 4.2 stating that the procedure is valid under correct specification and i.i.d. sampling—the standard setting for LOO predictive checks. Under misspecification the marginal uniformity of the PIT values itself fails, so the test addresses the joint hypothesis. For non-i.i.d. data the LOO construction itself would require modification; we note this as a scope limitation rather than a claim of universal validity. revision: partial
-
Referee: [Table 3] Table 3 (power results, high-complexity row): the reported power advantage is large, but the simulation design fixes n=100 and uses a single complexity proxy; without results for smaller n (where dependence is strongest) or varying effective degrees of freedom, the scaling claim cannot be fully assessed.
Authors: We concur that additional simulation settings would better illustrate the scaling behavior. The revised manuscript now includes results for n = 50 and n = 200 together with a systematic sweep of effective degrees of freedom (from 5 to 50) while keeping the high-complexity regime. The power advantage of the proposed tests widens as n decreases relative to complexity, consistent with the O(p/n) dependence term derived in Section 3. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proves finite-sample dependence of LOO-PIT values via an internal mathematical argument that scales with model complexity, then derives three new testing procedures for dependent uniforms and validates them through direct simulation and real-data experiments. No equation or procedure reduces by construction to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the proof and test derivations are presented as self-contained against the stated assumptions, with external benchmarks (power comparisons) serving as independent checks rather than tautological inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LOO predictive distributions are determined by nearly the same data (all but one observation), inducing non-negligible dependence among LOO-PIT values that depends on model complexity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that this dependency is non-negligible in the finite case and depends on model complexity... three testing procedures... for continuous and discrete dependent uniform values
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.