pith. sign in

arxiv: 2604.22948 · v2 · pith:RMCEY4XHnew · submitted 2026-04-24 · 💻 cs.LG · stat.CO· stat.ML

Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces

Pith reviewed 2026-05-08 12:06 UTC · model grok-4.3

classification 💻 cs.LG stat.COstat.ML
keywords Monte Carlo methodsnon-Markovian samplingscore functionsstochastic approximationvariance reductionMarkov chain Monte Carlohistory-dependent samplinggeneral state spaces
0
0 comments X

The pith

Score-Repellent Monte Carlo reduces asymptotic sampling variance as O(1/α) using constant-memory history summaries in general state spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for Monte Carlo sampling that incorporates a limited form of memory to avoid revisiting the same regions repeatedly. Instead of storing full history, it maintains a running average of score function evaluations and uses this to create a time-varying surrogate target distribution via an exponential tilt. This surrogate can be targeted by any standard base sampler, with the history updated online at each step. Theoretical analysis establishes convergence of the process and shows that stronger repulsion parameters can lead to lower estimator variance in certain regimes. The approach extends efficient history-dependent sampling techniques to continuous and high-dimensional spaces while keeping memory costs low.

Core claim

Score-Repellent Monte Carlo (SRMC) summarizes trajectory history by a running average of score evaluations, which is converted into a surrogate target through an exponential score tilt controlled by parameter α. Any base kernel targeting the original distribution can be applied to the current surrogate, while the history is updated online. Using stochastic approximation with controlled Markovian noise, the coupled system is shown to converge almost surely with a joint central limit theorem. In identified regimes, the asymptotic covariance of the Monte Carlo estimators decreases with increasing α at a rate of O(1/α), extending near-zero-variance properties from finite-state cases to general 0

What carries the argument

The exponential score tilt applied to a running average of score evaluations, which generates a normalization-free surrogate target distribution that repels the sampler from previously visited regions.

Load-bearing premise

The assumptions required for the stochastic approximation analysis with controlled Markovian noise hold for the chosen base kernel and target distribution, allowing the coupled history recursion and estimators to converge as claimed.

What would settle it

Simulations that track the asymptotic variance of estimators while varying α and checking whether it scales as O(1/α) in the regimes claimed; observing no such decrease or an increase instead would falsify the variance reduction claim.

Figures

Figures reproduced from arXiv: 2604.22948 by Bohyung Han, Do Young Eun, Geeho Kim, Jie Hu, Jinyoung Choi, Lingyun Chen.

Figure 1
Figure 1. Figure 1: Score-repellent adaptation reshapes the score field to escape a metastable trap. We consider a two-dimensional, two-mode target distribution π (a Gaussian mixture with an imbalanced structure: a dominant narrow mode forming a deep energy well on the left, and a broader mode on the right). Background color shows the (unnormalized) log-density landscape (darker indicates higher density / lower energy). Each … view at source ↗
Figure 2
Figure 2. Figure 2: Follow-up: MALA/HMC on Configured Targets view at source ↗
Figure 3
Figure 3. Figure 3: Mode mixing (solid) and diversity (dashed) evaluation on Static MNIST. number of parallel chains needed to obtain useful mode diversity in practice. 4.3. Discrete Energy-based Models We assess SRMC on the Static MNIST dataset using a dis￾crete energy-based model (EBM) on {0, 1} 784. For a fair and standardized comparison, our implementation is built upon the official GWG codebase (Grathwohl et al., 2021), … view at source ↗
Figure 4
Figure 4. Figure 4: Tuned-baseline ρ-sensitivity on the correlated Gaussian target. The main practical competition is between ρ = 0.6 and ρ = 0.8, while ρ = 1.0 remains weakest over the tested horizons. and then freeze at the realized value αˆ = αKw . In the 100k comparisons, the intended frozen working values are therefore approximately 0.8, 1.6, and 4.0 for αref = 1, 2, 5, respectively. Our second rule adds an exponent-scal… view at source ↗
Figure 5
Figure 5. Figure 5: Tuned-baseline ϵ-sensitivity on the Bayesian logistic target at α = 1. Very small ϵ values are clearly harmful for MALA, whereas ϵ = 0.1, 1, and ϵ = α are all stable. with continuous Gaussian noise. D.2.1. EXPERIMENT 1: GAUSSIAN MIXTURE VALIDATION Objective. To validate the mode exploration capability of SRMC in a controlled setting with known ground truth, we construct a synthetic Gaussian mixture benchma… view at source ↗
Figure 6
Figure 6. Figure 6: Fixed-α versus capped adaptive-α screening at 10k matched steps for nominal values {0.5, 1, 2, 3, 5}. The clearest practical gain appears in the aggressive Bayesian-logistic regime, especially for HMC, where adaptive warmup prevents large transient over-tilting. Mode Assignment. At each step, each sample is assigned to the nearest mode center (argmin of Euclidean distance to the 1,000 µk). We track the cum… view at source ↗
Figure 7
Figure 7. Figure 7: Exploration efficiency on Gaussian Mixture (1,000 modes). (a) SR-ULA achieves complete coverage in ∼1,035 steps, while ULA plateaus at 2.8%. (b) ULA clusters near indices 400–500, whereas SR-ULA uniformly traverses the landscape. EBM from Du & Mordatch (2019).4 This model implicitly captures the distribution over CIFAR-10 images (50,000 training images, uniform across 10 classes). Setup. A single chain run… view at source ↗
Figure 8
Figure 8. Figure 8: Single-chain trajectory analysis (2,000 steps). Left: Class distribution over trajectory. ULA collapses to airplane and horse; SR-ULA covers 7 classes. Center: Mode coverage metrics (lower is better). Right: Cumulative unique classes discovered over time. in a 62.60% reduction in KL divergence (0.558 vs. 1.492) and 38.97% lower TV distance. Furthermore, SR-ULA discovers modes more rapidly, reaching 8 uniqu… view at source ↗
Figure 9
Figure 9. Figure 9: Multi-chain evaluation (10 chains × 450 steps). Left: Class distribution of final samples. SR-ULA covers more classes and is closer to uniform. Right: Mode coverage metrics (lower is better). D.3. Static MNIST: Qualitative Trajectories and AIS-Based Diversity In this section, the state is a binarized MNIST image x ∈ {0, 1} 784 (i.e., 28 × 28 pixels) in a discrete configuration state space with 2 784 possib… view at source ↗
Figure 10
Figure 10. Figure 10: Static MNIST mode mixing (M = 100 chains, T = 10,000 steps, initialized at digit ‘7’). Each row shows 10 randomly selected chains at checkpoints n ∈ {0, 2500, 5000, 7500, 10000}. Top: Baseline GWG trajectories remain trapped near the initialization mode, with most samples resembling ‘7’ or visually similar digits. Bottom: SR-GWG trajectories exhibit faster escape from the initial mode and broader coverage… view at source ↗
read the original abstract

History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in $\mathbb{R}^d$, where $d$ is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with $\alpha$ that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting $\pi$ can instead be run on the current surrogate $\pi_{\theta_n}$ while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as $\alpha$ increases, with scaling $O(1/\alpha)$, extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining $O(d)$ memory usage and modest per-iteration overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Score-Repellent Monte Carlo (SRMC), a constant-memory non-Markovian sampler for general state spaces. History is summarized by a running average θ_n of score evaluations; this is used to form a normalization-free surrogate target π_θn via an exponential tilt of strength α. Any base kernel targeting the original π is instead applied to the current surrogate, with θ_n updated online. The coupled (θ_n, Monte Carlo estimator) recursion is analyzed via stochastic approximation with controlled Markovian noise, yielding almost-sure convergence and a joint central limit theorem. Regimes are identified in which the asymptotic covariance of the estimator decays as O(1/α). Experiments on continuous targets and discrete energy-based models report improved estimator variance and mode coverage while retaining O(d) memory.

Significance. If the stochastic-approximation analysis and the O(1/α) covariance scaling hold under the stated conditions, the work supplies a concrete mechanism for extending finite-state history-dependent variance reduction to general (including continuous) spaces with fixed memory cost. The joint CLT for the coupled recursion and the normalization-free surrogate construction are technically useful; the former enables rigorous asymptotic analysis of the non-Markovian scheme, while the latter permits any existing base kernel to be wrapped without redesign.

major comments (2)
  1. [Stochastic approximation analysis] The O(1/α) decay of asymptotic covariance (abstract and analysis section) is obtained by showing that the α-dependent tilt perturbation vanishes at the required rate inside the controlled-noise CLT. This step presupposes that the base kernel satisfies uniform ergodicity and moment bounds on the score that are independent of both α and the current θ_n. These conditions are not automatically inherited from a generic kernel targeting π and may fail when the score is unbounded or the target has heavy tails; explicit verification (or additional assumptions) for the kernels used in the experiments is therefore load-bearing for the central scaling claim.
  2. [Analysis of coupled recursion] The joint central limit theorem for the coupled recursion (analysis section) relies on the controlled Markovian noise framework applying to the history update and the Monte Carlo estimator. Without a self-contained statement of the precise ergodicity and moment conditions that guarantee the CLT (or a reference to a theorem that directly covers the α-dependent surrogate), it is difficult to confirm that the O(1/α) regime is attained for the targets and kernels considered.
minor comments (2)
  1. [Abstract] The abstract states that the surrogate family is 'normalization-free in the standard MCMC sense'; a brief sentence clarifying that the normalizing constant of π_θn need not be evaluated (because the base kernel is only required to target it up to proportionality) would help readers unfamiliar with the construction.
  2. [Introduction / Notation] Notation for the score dimension versus state dimension is introduced as 'd is the dimension of the score and state representation.' Consistent use of a single symbol (or explicit distinction) throughout the text would avoid ambiguity when memory cost is stated as O(d).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The recognition of the technical contributions of the normalization-free surrogate and the joint CLT analysis is appreciated. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Stochastic approximation analysis] The O(1/α) decay of asymptotic covariance (abstract and analysis section) is obtained by showing that the α-dependent tilt perturbation vanishes at the required rate inside the controlled-noise CLT. This step presupposes that the base kernel satisfies uniform ergodicity and moment bounds on the score that are independent of both α and the current θ_n. These conditions are not automatically inherited from a generic kernel targeting π and may fail when the score is unbounded or the target has heavy tails; explicit verification (or additional assumptions) for the kernels used in the experiments is therefore load-bearing for the central scaling claim.

    Authors: We agree that uniform ergodicity and α-independent moment bounds on the score are load-bearing assumptions for the O(1/α) covariance claim. The manuscript states these as part of the controlled Markovian noise framework applied to the base kernel. For the reported experiments, MALA is used on continuous targets satisfying standard smoothness and strong log-concavity conditions that guarantee geometric ergodicity, while the discrete energy-based models employ local Gibbs kernels that are ergodic on finite spaces. We will add a dedicated paragraph in the analysis section that explicitly lists the required kernel conditions, supplies references or brief verification for the experimental kernels, and discusses the additional restrictions needed for heavy-tailed targets. This will make the assumptions transparent without altering the main results. revision: yes

  2. Referee: [Analysis of coupled recursion] The joint central limit theorem for the coupled recursion (analysis section) relies on the controlled Markovian noise framework applying to the history update and the Monte Carlo estimator. Without a self-contained statement of the precise ergodicity and moment conditions that guarantee the CLT (or a reference to a theorem that directly covers the α-dependent surrogate), it is difficult to confirm that the O(1/α) regime is attained for the targets and kernels considered.

    Authors: The joint CLT is obtained by embedding the α-dependent surrogate within the controlled Markovian noise setting, where the perturbation induced by θ_n is shown to vanish at the appropriate rate. We will revise the analysis section to include a self-contained statement of the precise ergodicity and moment conditions (uniform geometric ergodicity of the base kernel together with bounded second moments of the score that are uniform in θ_n over a compact set containing the limit). We will also cite the specific theorem from the controlled-noise literature that is applied and briefly verify that the α-dependent tilt satisfies the required Lipschitz and boundedness conditions. These additions will allow direct confirmation that the O(1/α) regime holds under the stated assumptions for the targets considered. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained stochastic approximation analysis

full rationale

The paper defines SRMC via a running score average θ_n converted to an α-indexed exponential tilt surrogate π_θn, runs any base kernel on the surrogate, and analyzes the coupled (θ_n, estimator) recursion via stochastic approximation with controlled Markovian noise to obtain a.s. convergence and a joint CLT. The O(1/α) asymptotic covariance scaling is identified as a consequence of that CLT under stated regimes and assumptions on the base kernel (uniform ergodicity, moment bounds independent of α and θ_n). No step reduces by the paper's own equations to a fitted parameter renamed as prediction, a self-definitional construct, or a load-bearing self-citation chain; the central claims remain independent of the experimental data used for illustration. The derivation therefore does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the introduction of the score-average history summary and the surrogate family, plus standard stochastic-approximation assumptions for the coupled dynamics; alpha is a user-chosen parameter rather than a fitted constant.

free parameters (1)
  • alpha
    User-chosen parameter controlling the magnitude of history-based repulsion in the exponential score tilt.
axioms (2)
  • domain assumption Any base kernel targeting the original distribution pi can be run on the current surrogate pi_theta_n
    The wrapper property is stated without further justification in the abstract.
  • domain assumption Stochastic approximation with controlled Markovian noise applies to the joint evolution of history recursion and Monte Carlo estimators
    Invoked to obtain almost-sure convergence and the joint CLT.
invented entities (1)
  • surrogate target pi_theta_n no independent evidence
    purpose: To encode trajectory history via score tilt for repulsion while remaining normalization-free
    New construct introduced to enable constant-memory non-Markovian behavior.

pith-pipeline@v0.9.0 · 5586 in / 1677 out tokens · 76256 ms · 2026-05-08T12:06:26.036355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.