Besag-Clifford e-values for unnormalized testing
Pith reviewed 2026-05-15 09:49 UTC · model grok-4.3
The pith
Besag-Clifford e-values allow valid testing of unnormalized models by generating exchangeable MCMC samples under the null.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
As the number of samples grows, Besag-Clifford e-values constructed using the unnormalized likelihood ratio are log-optimal up to a multiplicative term that diminishes with the mixing time of the Markov chain. Averaging over the output of multiple chains retains validity while increasing the e-power. The method extends to the general problem of unnormalized test statistics for composite hypotheses, uncertainty quantification, generative model evaluation, and sequential testing.
What carries the argument
Parallel MCMC chains initialized from the data to produce exchangeable samples under the null, combined with the unnormalized likelihood ratio to form e-values.
Load-bearing premise
The parallel MCMC chains must generate samples that are exchangeable with the observed data under the null hypothesis, requiring proper initialization and sufficient mixing.
What would settle it
Observing that the constructed e-values exceed the nominal threshold more often than the allowed error rate in repeated simulations under the null with controllable mixing times would falsify the validity claim.
read the original abstract
Unnormalized probability distributions are frequently used in machine learning for modeling complex data generating processes. Though Markov chain Monte Carlo (MCMC) algorithms can approximately sample from unnormalized distributions, intractability of their normalizing constants renders likelihood ratio testing infeasible. We propose to use the parallel method of Besag and Clifford to generate samples that are exchangeable with the data under the null, to then generate valid e-values for any number of iterations or algorithmic steps. We show that as the number of samples grows, these Besag-Clifford e-values constructed using the unnormalized likelihood ratio are actually log-optimal up to a multiplicative term that diminishes with the mixing time of the Markov chain. Additionally, averaging over the output of multiple chains retains validity while increasing the e-power. We extend Besag-Clifford e-values to the general problem of unnormalized test statistics, which allows application to composite hypotheses, uncertainty quantification, generative model evaluation, and sequential testing. Through simulations and an application to galaxy velocity modeling, we empirically verify our theory, explore the impact of autocorrelation and mixing, and evaluate the performance of Besag-Clifford e-values.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Besag-Clifford e-values for hypothesis testing with unnormalized distributions by using parallel MCMC chains to produce samples exchangeable with the observed data under the null. This yields valid e-values for any number of iterations. The central theoretical result is that these e-values are asymptotically log-optimal up to a multiplicative factor that vanishes with the Markov chain mixing time. The approach is extended to unnormalized test statistics for composite hypotheses, uncertainty quantification, and sequential testing. Validity is retained when averaging multiple chains, and the method is illustrated via simulations and an application to galaxy velocity modeling.
Significance. If the asymptotic claim holds with practically useful rates, the work would provide a principled, parameter-free route to valid e-value testing in the common setting of intractable normalizing constants. It usefully connects the Besag-Clifford exchangeability construction to modern e-value theory and MCMC practice. The extension to composite hypotheses and the empirical galaxy-data example are concrete strengths; reproducible code or machine-checked proofs are not mentioned.
major comments (2)
- [§3] §3 (asymptotic log-optimality theorem): the claim that the multiplicative correction term vanishes with mixing time is central to both the optimality and practical validity statements, yet the manuscript supplies no explicit quantitative bound relating the term's size to chain length, dimension, initialization distribution, or sample size. Without such a rate, it is impossible to determine whether the correction becomes negligible before the asymptotic regime is reached in realistic MCMC settings.
- [§5] §5 (simulation studies): the reported experiments contain no error bars, confidence intervals, or sensitivity checks to starting distribution and autocorrelation length. Because the exchangeability argument (and therefore finite-sample validity) rests on sufficient mixing, the absence of these diagnostics leaves the empirical support for the theory incomplete.
minor comments (1)
- [Introduction] The dependence of the e-value on the precise number of algorithmic steps within each chain is mentioned in the abstract but not quantified in the main text; a short remark on this point would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper accordingly.
read point-by-point responses
-
Referee: [§3] §3 (asymptotic log-optimality theorem): the claim that the multiplicative correction term vanishes with mixing time is central to both the optimality and practical validity statements, yet the manuscript supplies no explicit quantitative bound relating the term's size to chain length, dimension, initialization distribution, or sample size. Without such a rate, it is impossible to determine whether the correction becomes negligible before the asymptotic regime is reached in realistic MCMC settings.
Authors: We acknowledge that an explicit rate would be desirable for practical guidance. However, since MCMC mixing times are highly problem-dependent and our result is stated in terms of the general mixing time, providing a universal quantitative bound is not feasible without further assumptions on the chain. In the revised manuscript, we have expanded the discussion in §3 to include practical guidance on assessing mixing via standard diagnostics and noted that the term vanishes in the limit of perfect mixing. revision: partial
-
Referee: [§5] §5 (simulation studies): the reported experiments contain no error bars, confidence intervals, or sensitivity checks to starting distribution and autocorrelation length. Because the exchangeability argument (and therefore finite-sample validity) rests on sufficient mixing, the absence of these diagnostics leaves the empirical support for the theory incomplete.
Authors: We agree with this observation. The revised manuscript now includes error bars computed from repeated simulations, confidence intervals for key performance metrics, and additional experiments varying the initialization and chain lengths to illustrate the impact of mixing on the results. revision: yes
Circularity Check
No significant circularity in derivation of Besag-Clifford e-values
full rationale
The paper constructs Besag-Clifford e-values from parallel MCMC chains to achieve exchangeability under the null, then derives asymptotic log-optimality of the unnormalized likelihood ratio version up to a mixing-time factor. This chain relies on established prior results for e-values and MCMC exchangeability rather than reducing any central claim to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or step equates the optimality result to its inputs by construction; the mixing correction is stated as a theoretical vanishing term, not a fitted quantity. The derivation remains self-contained against external benchmarks in e-value theory and MCMC literature.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Parallel MCMC chains initialized under the null produce samples exchangeable with the observed data
- domain assumption The Markov chains have finite mixing time
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose to use the parallel method of Besag and Clifford to generate samples that are exchangeable with the data under the null, to then generate valid e-values for any number of iterations or algorithmic steps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.