Besag-Clifford e-values for unnormalized testing

Aaditya Ramdas; Alexander Dombowsky; Barbara E. Engelhardt

arxiv: 2603.15845 · v2 · submitted 2026-03-16 · 📊 stat.ME

Besag-Clifford e-values for unnormalized testing

Alexander Dombowsky , Barbara E. Engelhardt , Aaditya Ramdas This is my paper

Pith reviewed 2026-05-15 09:49 UTC · model grok-4.3

classification 📊 stat.ME

keywords e-valuesunnormalized modelsMCMChypothesis testingexchangeabilitylikelihood ratioMarkov chains

0 comments

The pith

Besag-Clifford e-values allow valid testing of unnormalized models by generating exchangeable MCMC samples under the null.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a testing procedure for probability models where the normalizing constant cannot be computed, a situation common in complex machine learning models. The approach runs multiple Markov chain Monte Carlo chains in parallel, starting from the observed data, to produce samples that match the distribution of the data under the null hypothesis. These samples enable the calculation of e-values using only the unnormalized likelihood ratio. The resulting e-values become log-optimal with growing sample size, up to a factor that improves with faster chain mixing. The technique further supports composite hypotheses, model evaluation, and sequential testing by generalizing to unnormalized test statistics.

Core claim

As the number of samples grows, Besag-Clifford e-values constructed using the unnormalized likelihood ratio are log-optimal up to a multiplicative term that diminishes with the mixing time of the Markov chain. Averaging over the output of multiple chains retains validity while increasing the e-power. The method extends to the general problem of unnormalized test statistics for composite hypotheses, uncertainty quantification, generative model evaluation, and sequential testing.

What carries the argument

Parallel MCMC chains initialized from the data to produce exchangeable samples under the null, combined with the unnormalized likelihood ratio to form e-values.

Load-bearing premise

The parallel MCMC chains must generate samples that are exchangeable with the observed data under the null hypothesis, requiring proper initialization and sufficient mixing.

What would settle it

Observing that the constructed e-values exceed the nominal threshold more often than the allowed error rate in repeated simulations under the null with controllable mixing times would falsify the validity claim.

read the original abstract

Unnormalized probability distributions are frequently used in machine learning for modeling complex data generating processes. Though Markov chain Monte Carlo (MCMC) algorithms can approximately sample from unnormalized distributions, intractability of their normalizing constants renders likelihood ratio testing infeasible. We propose to use the parallel method of Besag and Clifford to generate samples that are exchangeable with the data under the null, to then generate valid e-values for any number of iterations or algorithmic steps. We show that as the number of samples grows, these Besag-Clifford e-values constructed using the unnormalized likelihood ratio are actually log-optimal up to a multiplicative term that diminishes with the mixing time of the Markov chain. Additionally, averaging over the output of multiple chains retains validity while increasing the e-power. We extend Besag-Clifford e-values to the general problem of unnormalized test statistics, which allows application to composite hypotheses, uncertainty quantification, generative model evaluation, and sequential testing. Through simulations and an application to galaxy velocity modeling, we empirically verify our theory, explore the impact of autocorrelation and mixing, and evaluate the performance of Besag-Clifford e-values.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Besag-Clifford parallel sampling to build valid e-values from unnormalized likelihood ratios, with a log-optimality result that holds asymptotically once mixing is good enough.

read the letter

The main advance is a concrete way to get e-values for testing unnormalized models by running parallel MCMC chains in the Besag-Clifford style so the samples are exchangeable with the data under the null. They plug the unnormalized likelihood ratio into the e-value construction and show it becomes log-optimal as the number of samples grows, up to a multiplicative correction that shrinks with better mixing. They also extend the idea to general unnormalized test statistics, which covers composite hypotheses, model evaluation, and sequential testing, and note that averaging chains keeps validity while raising power. The galaxy velocity application and simulations on autocorrelation give some empirical grounding. The soft spot is the mixing requirement. Exchangeability only holds exactly when chains start in stationarity and have mixed; the paper treats any leftover bias as a vanishing term, but without explicit rates tying the size of that term to chain length, dimension, or starting distribution, it is hard to know when the near-optimality kicks in for typical ML models. If mixing is slow, finite-sample validity and power could suffer more than the theory suggests. The simulations explore autocorrelation, but the strength of those checks would need a closer look. This is useful for statisticians and ML researchers who need valid inference with intractable normalizing constants. A reader already working with e-values or MCMC diagnostics would pick up the specific construction and the optimality claim quickly. It deserves a serious referee because the core idea is new, the problem is real, and the claims are checkable.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Besag-Clifford e-values for hypothesis testing with unnormalized distributions by using parallel MCMC chains to produce samples exchangeable with the observed data under the null. This yields valid e-values for any number of iterations. The central theoretical result is that these e-values are asymptotically log-optimal up to a multiplicative factor that vanishes with the Markov chain mixing time. The approach is extended to unnormalized test statistics for composite hypotheses, uncertainty quantification, and sequential testing. Validity is retained when averaging multiple chains, and the method is illustrated via simulations and an application to galaxy velocity modeling.

Significance. If the asymptotic claim holds with practically useful rates, the work would provide a principled, parameter-free route to valid e-value testing in the common setting of intractable normalizing constants. It usefully connects the Besag-Clifford exchangeability construction to modern e-value theory and MCMC practice. The extension to composite hypotheses and the empirical galaxy-data example are concrete strengths; reproducible code or machine-checked proofs are not mentioned.

major comments (2)

[§3] §3 (asymptotic log-optimality theorem): the claim that the multiplicative correction term vanishes with mixing time is central to both the optimality and practical validity statements, yet the manuscript supplies no explicit quantitative bound relating the term's size to chain length, dimension, initialization distribution, or sample size. Without such a rate, it is impossible to determine whether the correction becomes negligible before the asymptotic regime is reached in realistic MCMC settings.
[§5] §5 (simulation studies): the reported experiments contain no error bars, confidence intervals, or sensitivity checks to starting distribution and autocorrelation length. Because the exchangeability argument (and therefore finite-sample validity) rests on sufficient mixing, the absence of these diagnostics leaves the empirical support for the theory incomplete.

minor comments (1)

[Introduction] The dependence of the e-value on the precise number of algorithmic steps within each chain is mentioned in the abstract but not quantified in the main text; a short remark on this point would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [§3] §3 (asymptotic log-optimality theorem): the claim that the multiplicative correction term vanishes with mixing time is central to both the optimality and practical validity statements, yet the manuscript supplies no explicit quantitative bound relating the term's size to chain length, dimension, initialization distribution, or sample size. Without such a rate, it is impossible to determine whether the correction becomes negligible before the asymptotic regime is reached in realistic MCMC settings.

Authors: We acknowledge that an explicit rate would be desirable for practical guidance. However, since MCMC mixing times are highly problem-dependent and our result is stated in terms of the general mixing time, providing a universal quantitative bound is not feasible without further assumptions on the chain. In the revised manuscript, we have expanded the discussion in §3 to include practical guidance on assessing mixing via standard diagnostics and noted that the term vanishes in the limit of perfect mixing. revision: partial
Referee: [§5] §5 (simulation studies): the reported experiments contain no error bars, confidence intervals, or sensitivity checks to starting distribution and autocorrelation length. Because the exchangeability argument (and therefore finite-sample validity) rests on sufficient mixing, the absence of these diagnostics leaves the empirical support for the theory incomplete.

Authors: We agree with this observation. The revised manuscript now includes error bars computed from repeated simulations, confidence intervals for key performance metrics, and additional experiments varying the initialization and chain lengths to illustrate the impact of mixing on the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of Besag-Clifford e-values

full rationale

The paper constructs Besag-Clifford e-values from parallel MCMC chains to achieve exchangeability under the null, then derives asymptotic log-optimality of the unnormalized likelihood ratio version up to a mixing-time factor. This chain relies on established prior results for e-values and MCMC exchangeability rather than reducing any central claim to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equation or step equates the optimality result to its inputs by construction; the mixing correction is stated as a theoretical vanishing term, not a fitted quantity. The derivation remains self-contained against external benchmarks in e-value theory and MCMC literature.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MCMC domain assumptions rather than new free parameters or invented entities.

axioms (2)

domain assumption Parallel MCMC chains initialized under the null produce samples exchangeable with the observed data
Invoked to guarantee validity of the e-values for any number of iterations.
domain assumption The Markov chains have finite mixing time
Used to bound the multiplicative gap in the log-optimality result.

pith-pipeline@v0.9.0 · 5498 in / 1384 out tokens · 55288 ms · 2026-05-15T09:49:15.639155+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose to use the parallel method of Besag and Clifford to generate samples that are exchangeable with the data under the null, to then generate valid e-values for any number of iterations or algorithmic steps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.