pith. sign in

arxiv: 2602.04189 · v2 · submitted 2026-02-04 · 💻 cs.LG · stat.CO

Beyond Accuracy: Evaluating Posterior Fidelity of Diffusion Inverse Solvers

Pith reviewed 2026-05-16 07:26 UTC · model grok-4.3

classification 💻 cs.LG stat.CO
keywords diffusion inverse solversposterior fidelityuncertainty quantificationkernel stein discrepancyinverse problemsscore matchingdistributional evaluationreconstruction accuracy
0
0 comments X

The pith

Reconstruction accuracy in diffusion inverse solvers does not imply good posterior consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether diffusion inverse solvers that achieve high reconstruction accuracy also produce samples that match the true posterior distribution. Experiments in controlled settings with known analytical posteriors show that accuracy and distributional fidelity can diverge across methods. To enable evaluation where the true posterior is unknown, the authors develop score-KSD, a metric that checks consistency with the posterior score field induced by the forward model and diffusion prior. This distinction is relevant for uncertainty quantification because inverse problems in science and engineering require samples that reflect full distributional behavior rather than just average reconstructions.

Core claim

Existing diffusion inverse solvers are assessed mainly on reconstruction accuracy, yet this does not necessarily indicate that their generated samples come from the correct posterior. The work introduces score-based Kernel Stein Discrepancy (score-KSD) to quantify posterior fidelity in a ground-truth-free manner by measuring alignment with the score field of the target posterior. Simulations confirm that solvers with comparable accuracy can exhibit markedly different posterior consistency, and real-world experiments validate that score-KSD yields meaningful diagnostics beyond accuracy alone.

What carries the argument

score-based Kernel Stein Discrepancy (score-KSD), a metric that assesses how well the distribution of generated samples matches the target posterior score field induced by the forward model and learned diffusion prior.

If this is right

  • Accuracy-focused benchmarks for diffusion inverse solvers can overlook mismatches in posterior distributions.
  • score-KSD enables posterior-aware evaluation on real inverse problems without access to ground-truth samples.
  • Solvers may require joint optimization or selection criteria that balance reconstruction accuracy with distributional consistency.
  • In scientific applications, uncertainty estimates derived from inaccurate posterior samples risk misrepresenting variability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of score-KSD could shift solver development toward objectives that explicitly target posterior fidelity alongside accuracy.
  • The metric may extend to evaluating other generative approaches for inverse problems that rely on score-based sampling.
  • Combining score-KSD with visual or task-specific checks could provide more complete validation of uncertainty estimates.

Load-bearing premise

The score field induced by the forward model and the learned diffusion prior accurately represents the target posterior distribution.

What would settle it

A simulation experiment where score-KSD reports high fidelity for a solver whose samples deviate measurably from the known analytical posterior would falsify the metric.

Figures

Figures reproduced from arXiv: 2602.04189 by Guanyang Wang, Liyue Shen, Taewon Yang, Xiaoyu Qiu, Zhanhao Liu.

Figure 1
Figure 1. Figure 1: Illustration of the Accuracy Trap phenomenon and three types of uncertainty behaviors. Blue contours show the ground-truth posterior p(x | y), which can be multi-modes. The red star denotes the ground-truth x ∗ ;xˆ1, xˆ3 are posterior-plausible reconstructions, while xˆ2 is an off-posterior reconstruction that can be closer to x ∗ than xˆ3. The posterior targeting sampler (red) aims to match the posterior … view at source ↗
Figure 2
Figure 2. Figure 2: Similar reconstruction with distinct uncertainty. Comparison of PnPDP solvers on linear inverse scattering reconstruction with K = 100 times reconstruction on each solver. Top two Row: These methods produce similar reconstruction quality (in PSNR). Bottom Row: The pixel-wise variance maps reveal a fundamental difference. REDDiff[Mardani et al., 2023a] exhibits the lowest variance, while DPS[Chung et al., 2… view at source ↗
Figure 3
Figure 3. Figure 3: Average 95% coverage across methods (bars), with mean interval width marked [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Observed- and null-space posterior variances in paired bars, with null/observed [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–UQ scatterplots. Left: RMSE versus average coverage in Exp.1 (A=I). [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Normalized reconstruction accuracy (PSNR) versus pixel-wise uncertainty. Each [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sparse-sampling MRI under ×4 acceleration rate (AR=4). imply larger or smaller variance: in the 360-receiver inverse scattering ( [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Out-of-distribution reconstruction result of 20-view CT using [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inverse linear scattering with 360 receivers [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sparse view CT: Normalized reconstruction accuracy (PSNR) versus pixel-wise [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Linear inverse scattering with 180 receivers. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Compressed sensing MRI with AR=8 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sparse view CT with in distribution test image [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Sparse view CT with out of distribution test image [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: FPS and MCG-Diff under linear inverse scattering with 180 receivers: effect of particle count on uncertainty. Left two: runs with small particle sizes (FPS: N = 20; MCG-Diff: N = 16) show degraded uncertainty estimates. Right two: increasing the particle size to N = 64 for both solvers yields more moderate empirical variance, while the reconstruction accuracy remains nearly unchanged [PITH_FULL_IMAGE:fig… view at source ↗
Figure 16
Figure 16. Figure 16: PnPDM MRI AR=4, the left is with learning rate [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: For a fixed MRI test measurement, we run [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
read the original abstract

Uncertainty evaluation is critical in scientific and engineering inverse problems. However, existing benchmarks on Diffusion Inverse Solvers (DIS) primarily focus on reconstruction accuracy but overlook uncertainty and distributional behavior. Since stochastic inverse solvers represent uncertainty through diffusion-based posterior samples, evaluating how well their generated samples capture the target posterior distribution becomes an important aspect of uncertainty quantification. To address this limitation and better understand the distributional behavior of diffusion samplers, we conduct a systematic study to investigate the posterior fidelity of a broad range of existing DIS methods in controlled simulation settings with a known analytical true posterior. Furthermore, to enable posterior-aware evaluation on real-world inverse problems where ground-truth posterior is unavailable, we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior. Through both simulation experiments and real-world inverse problem solving, we validate the effectiveness of the proposed score-KSD and demonstrate that it provides meaningful posterior fidelity diagnostics beyond reconstruction accuracy, revealing that higher reconstruction accuracy does not necessarily imply better posterior consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that benchmarks for Diffusion Inverse Solvers (DIS) over-emphasize reconstruction accuracy while neglecting posterior fidelity. In simulations with known analytical posteriors, existing DIS methods are evaluated for distributional consistency. The authors introduce score-KSD, a ground-truth-free metric that quantifies how well generated samples match the score of the model-induced posterior p(x|y) ∝ p(y|x) p_θ(x), where p_θ is the learned diffusion prior. Experiments on both simulated and real-world inverse problems are used to argue that higher accuracy does not imply better posterior consistency.

Significance. If the central claim holds, the work supplies a practical diagnostic for uncertainty quantification in diffusion-based inverse solvers, an area of growing importance in scientific imaging and engineering. The proposal of score-KSD as a theoretically motivated, prior-dependent but ground-truth-free metric could shift evaluation practice away from point-wise accuracy alone, provided its ranking behavior is shown to align with ground-truth metrics in controlled settings.

major comments (3)
  1. [Simulation experiments] Simulation section (exact location not numbered in abstract but implied by 'controlled simulation settings'): the manuscript must explicitly compare the ranking of DIS methods produced by score-KSD (using the learned prior) against a ground-truth metric computed from the analytical posterior. Any mismatch would indicate that score-KSD is diagnosing fidelity to an approximate posterior rather than the intended target, directly undermining the claim that it provides reliable diagnostics beyond accuracy.
  2. [score-KSD definition] Definition of score-KSD: the metric is defined directly from the forward-model score and the learned diffusion prior p_θ. The paper should state the precise conditions under which this equals the true posterior score and quantify the sensitivity to prior approximation error, especially since the skeptic note highlights that the learned prior may not match the data distribution exactly.
  3. [Real-world experiments] Real-world validation: without ground truth, effectiveness is asserted via 'meaningful posterior fidelity diagnostics.' The manuscript needs to provide concrete evidence (e.g., correlation with downstream task performance or consistency across multiple priors) that score-KSD rankings are not artifacts of the particular learned p_θ.
minor comments (2)
  1. [Method] Clarify notation for the score field induced by p(y|x) p_θ(x) to avoid ambiguity with the true posterior score.
  2. [Experiments] Add explicit error bars or statistical significance tests for the reported discrepancies between accuracy and score-KSD rankings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped clarify and strengthen our presentation of score-KSD as a diagnostic for posterior fidelity. We have revised the manuscript to incorporate explicit ranking comparisons, theoretical clarifications, and additional empirical validation. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Simulation experiments] Simulation section (exact location not numbered in abstract but implied by 'controlled simulation settings'): the manuscript must explicitly compare the ranking of DIS methods produced by score-KSD (using the learned prior) against a ground-truth metric computed from the analytical posterior. Any mismatch would indicate that score-KSD is diagnosing fidelity to an approximate posterior rather than the intended target, directly undermining the claim that it provides reliable diagnostics beyond accuracy.

    Authors: We agree that an explicit ranking comparison is essential for validating score-KSD. In the revised manuscript (new subsection 4.3), we compute both score-KSD (with the learned prior) and a ground-truth KSD using the analytical posterior. The rankings of DIS methods align closely, with Spearman rank correlation exceeding 0.92 across all tested settings. This confirms that score-KSD faithfully reflects fidelity to the target posterior rather than merely to the approximate prior. revision: yes

  2. Referee: [score-KSD definition] Definition of score-KSD: the metric is defined directly from the forward-model score and the learned diffusion prior p_θ. The paper should state the precise conditions under which this equals the true posterior score and quantify the sensitivity to prior approximation error, especially since the skeptic note highlights that the learned prior may not match the data distribution exactly.

    Authors: We have expanded Section 3.2 to state the precise condition: score-KSD equals the true posterior score discrepancy if and only if p_θ matches the data-generating distribution exactly. We further derive a bound showing that the error in score-KSD is at most the L2 norm of the score difference between the learned and true prior, scaled by the kernel bandwidth. This sensitivity is quantified empirically in the simulation section by injecting controlled prior mismatch and reporting the resulting deviation in score-KSD values. revision: yes

  3. Referee: [Real-world experiments] Real-world validation: without ground truth, effectiveness is asserted via 'meaningful posterior fidelity diagnostics.' The manuscript needs to provide concrete evidence (e.g., correlation with downstream task performance or consistency across multiple priors) that score-KSD rankings are not artifacts of the particular learned p_θ.

    Authors: We have added Section 5.4 with two concrete validations. First, score-KSD rankings correlate with downstream calibration error (negative log-likelihood on held-out measurements) at Pearson r = 0.87. Second, we retrain three independent diffusion priors on the same data and show that the relative ordering of DIS methods remains stable (Kendall tau > 0.8). These results indicate that the diagnostics are not artifacts of a single p_θ. revision: yes

Circularity Check

0 steps flagged

Score-KSD defined from model-induced score; main claim is empirical, not circular

full rationale

The paper introduces score-KSD as a new diagnostic that compares samples to the posterior score induced by the forward model plus the learned diffusion prior p_θ(x). This definition is explicit and acknowledged rather than hidden. In controlled simulations the authors compare against an analytical true posterior, providing an external check. The central claim (accuracy does not imply posterior consistency) is presented as an experimental observation across methods, not as a mathematical identity or prediction forced by fitting. No self-citation load-bearing step, no uniqueness theorem imported from the same authors, and no renaming of a known result appear in the provided text. The dependence on the learned prior is a modeling assumption, not a circular reduction of the reported result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities beyond the newly proposed score-KSD metric are described.

invented entities (1)
  • score-KSD metric no independent evidence
    purpose: Ground-truth-free measure of consistency between DIS samples and target posterior score field
    Newly introduced in the paper as the central methodological contribution.

pith-pipeline@v0.9.0 · 5512 in / 1133 out tokens · 24146 ms · 2026-05-16T07:26:12.765565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    UQ-driven categorization of PnP diffusion methods... Posterior-targeting solvers... Heuristic solvers... MAP-like solvers

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Stability Benchmark of Generative Regularizers for Inverse Problems

    eess.IV 2026-05 unverdicted novelty 5.0

    Numerical benchmarks indicate generative regularizers deliver strong reconstructions in some imaging inverse problem settings but can be unstable or problematic under imperfect conditions compared to variational methods.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Inverse problems in astronomy,

    URLhttps://www.osti.gov/biblio/5734250. Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Peyman Milanfar, Alexandros G. Dimakis, and Mauricio Delbracio. A survey on diffusion models for inverse problems, 2024. URLhttps://arxiv.org/abs/2410.00083. Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem s...

  2. [2]

    Denoising Diffusion Probabilistic Models

    URLhttps://arxiv.org/abs/2006.11239. Paul Hofman, Yusuf Sale, and Eyke Hüllermeier. Quantifying aleatoric and epistemic uncertainty with proper scoring rules.arXiv preprint arXiv:2404.12215, 2024. Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):...

  3. [3]

    Score-Based Generative Modeling through Stochastic Differential Equations

    URLhttps://arxiv.org/abs/2011.13456. Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models, 2022b. URLhttps://arxiv.org/abs/ 2111.08005. J. Virieux and S. Operto. An overview of full-waveform inversion in exploration geophysics. Geophysics, 74(6):WCC1–WCC26, 12 2009. ISSN 0016-80...

  4. [4]

    fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

    URLhttps://arxiv.org/abs/1811.08839. Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, and Yang Song. Improving diffusion inverse problem solving with decoupled noise annealing,

  5. [5]

    Improving diffusion inverse problem solving with decoupled noise annealing.arXiv preprint arXiv:2407.01521, 2024

    URLhttps://arxiv.org/abs/2407.01521. Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman, and Yisong Yue. Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences. InThe Thirteenth International Conference...

  6. [6]

    (Score accuracy)In the ideal case, the diffusion score is exact, i.e.,sθ(xt,t ) = ∇xt logpt(xt)(equivalentlyεscore = 0)

  7. [7]

    (Convergence from initialization)The Markov chain is run sufficiently long so that the initialization-dependent term vanishes, i.e.,K→∞(henceT =K(t∗+1)→ ∞) with a fixedρ>0and regular schedule constants bounded away from0

  8. [8]

    approximate posterior sampling

    (Split bias)The coupling parameter ρ→0, so that the split marginal πρ,X approaches the exact posteriorp(x|y). Theorem B.2(PnP-DM convergence bound (Theorem 3.1 in [Wu et al., 2024])).Consider runningK iterations of PnP-DM with a constant couplingρk≡ρ>0and a score estimate st. Lett∗>0satisfyσ(t∗) =ρ, and define v(t) :=s(t) √ 2 ˙σ(t)σ(t), δ:= inf t∈[0,t∗] v...

  9. [9]

    (Score accuracy).The learned score is exact, i.e.,sθ(xt,t ) =∇xt logpt(xt)for all t

  10. [10]

    model consistency

    (Accurate reverse solver).The reverse-time ODE/SDE solver is asymptotically exact as the step size∆t→0(equivalently, the discretization grid is refined so that the induced reverse transition kernel matches the continuous-time reverse dynamics). Remark.Let λt−1(xt−1|xt)denote the learned backward kernel used to sample the reverse trajectory. Under Assumpti...

  11. [11]

    Score accuracy.The score model is exact for the diffusion marginal, i.e.,sθ(xt,t ) = ∇xt logpt(xt)for allt

  12. [12]

    Vanishing discretization error.The reverse-time ODE/SDE solver is asymptoti- cally exact as the step size∆t =T/N→0, so that the backward diffusion dynamics incur no discretization error. Theorem B.6(Asymptotic consistency of FPS-SMC [Dou and Song, 2024]).Letpθ(x0:N| y0:N)denote the joint distribution over the diffusion-time path produced by FPS-SMC with M...