Beyond Accuracy: Evaluating Posterior Fidelity of Diffusion Inverse Solvers
Pith reviewed 2026-05-16 07:26 UTC · model grok-4.3
The pith
Reconstruction accuracy in diffusion inverse solvers does not imply good posterior consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing diffusion inverse solvers are assessed mainly on reconstruction accuracy, yet this does not necessarily indicate that their generated samples come from the correct posterior. The work introduces score-based Kernel Stein Discrepancy (score-KSD) to quantify posterior fidelity in a ground-truth-free manner by measuring alignment with the score field of the target posterior. Simulations confirm that solvers with comparable accuracy can exhibit markedly different posterior consistency, and real-world experiments validate that score-KSD yields meaningful diagnostics beyond accuracy alone.
What carries the argument
score-based Kernel Stein Discrepancy (score-KSD), a metric that assesses how well the distribution of generated samples matches the target posterior score field induced by the forward model and learned diffusion prior.
If this is right
- Accuracy-focused benchmarks for diffusion inverse solvers can overlook mismatches in posterior distributions.
- score-KSD enables posterior-aware evaluation on real inverse problems without access to ground-truth samples.
- Solvers may require joint optimization or selection criteria that balance reconstruction accuracy with distributional consistency.
- In scientific applications, uncertainty estimates derived from inaccurate posterior samples risk misrepresenting variability.
Where Pith is reading between the lines
- Widespread use of score-KSD could shift solver development toward objectives that explicitly target posterior fidelity alongside accuracy.
- The metric may extend to evaluating other generative approaches for inverse problems that rely on score-based sampling.
- Combining score-KSD with visual or task-specific checks could provide more complete validation of uncertainty estimates.
Load-bearing premise
The score field induced by the forward model and the learned diffusion prior accurately represents the target posterior distribution.
What would settle it
A simulation experiment where score-KSD reports high fidelity for a solver whose samples deviate measurably from the known analytical posterior would falsify the metric.
Figures
read the original abstract
Uncertainty evaluation is critical in scientific and engineering inverse problems. However, existing benchmarks on Diffusion Inverse Solvers (DIS) primarily focus on reconstruction accuracy but overlook uncertainty and distributional behavior. Since stochastic inverse solvers represent uncertainty through diffusion-based posterior samples, evaluating how well their generated samples capture the target posterior distribution becomes an important aspect of uncertainty quantification. To address this limitation and better understand the distributional behavior of diffusion samplers, we conduct a systematic study to investigate the posterior fidelity of a broad range of existing DIS methods in controlled simulation settings with a known analytical true posterior. Furthermore, to enable posterior-aware evaluation on real-world inverse problems where ground-truth posterior is unavailable, we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior. Through both simulation experiments and real-world inverse problem solving, we validate the effectiveness of the proposed score-KSD and demonstrate that it provides meaningful posterior fidelity diagnostics beyond reconstruction accuracy, revealing that higher reconstruction accuracy does not necessarily imply better posterior consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that benchmarks for Diffusion Inverse Solvers (DIS) over-emphasize reconstruction accuracy while neglecting posterior fidelity. In simulations with known analytical posteriors, existing DIS methods are evaluated for distributional consistency. The authors introduce score-KSD, a ground-truth-free metric that quantifies how well generated samples match the score of the model-induced posterior p(x|y) ∝ p(y|x) p_θ(x), where p_θ is the learned diffusion prior. Experiments on both simulated and real-world inverse problems are used to argue that higher accuracy does not imply better posterior consistency.
Significance. If the central claim holds, the work supplies a practical diagnostic for uncertainty quantification in diffusion-based inverse solvers, an area of growing importance in scientific imaging and engineering. The proposal of score-KSD as a theoretically motivated, prior-dependent but ground-truth-free metric could shift evaluation practice away from point-wise accuracy alone, provided its ranking behavior is shown to align with ground-truth metrics in controlled settings.
major comments (3)
- [Simulation experiments] Simulation section (exact location not numbered in abstract but implied by 'controlled simulation settings'): the manuscript must explicitly compare the ranking of DIS methods produced by score-KSD (using the learned prior) against a ground-truth metric computed from the analytical posterior. Any mismatch would indicate that score-KSD is diagnosing fidelity to an approximate posterior rather than the intended target, directly undermining the claim that it provides reliable diagnostics beyond accuracy.
- [score-KSD definition] Definition of score-KSD: the metric is defined directly from the forward-model score and the learned diffusion prior p_θ. The paper should state the precise conditions under which this equals the true posterior score and quantify the sensitivity to prior approximation error, especially since the skeptic note highlights that the learned prior may not match the data distribution exactly.
- [Real-world experiments] Real-world validation: without ground truth, effectiveness is asserted via 'meaningful posterior fidelity diagnostics.' The manuscript needs to provide concrete evidence (e.g., correlation with downstream task performance or consistency across multiple priors) that score-KSD rankings are not artifacts of the particular learned p_θ.
minor comments (2)
- [Method] Clarify notation for the score field induced by p(y|x) p_θ(x) to avoid ambiguity with the true posterior score.
- [Experiments] Add explicit error bars or statistical significance tests for the reported discrepancies between accuracy and score-KSD rankings.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped clarify and strengthen our presentation of score-KSD as a diagnostic for posterior fidelity. We have revised the manuscript to incorporate explicit ranking comparisons, theoretical clarifications, and additional empirical validation. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Simulation experiments] Simulation section (exact location not numbered in abstract but implied by 'controlled simulation settings'): the manuscript must explicitly compare the ranking of DIS methods produced by score-KSD (using the learned prior) against a ground-truth metric computed from the analytical posterior. Any mismatch would indicate that score-KSD is diagnosing fidelity to an approximate posterior rather than the intended target, directly undermining the claim that it provides reliable diagnostics beyond accuracy.
Authors: We agree that an explicit ranking comparison is essential for validating score-KSD. In the revised manuscript (new subsection 4.3), we compute both score-KSD (with the learned prior) and a ground-truth KSD using the analytical posterior. The rankings of DIS methods align closely, with Spearman rank correlation exceeding 0.92 across all tested settings. This confirms that score-KSD faithfully reflects fidelity to the target posterior rather than merely to the approximate prior. revision: yes
-
Referee: [score-KSD definition] Definition of score-KSD: the metric is defined directly from the forward-model score and the learned diffusion prior p_θ. The paper should state the precise conditions under which this equals the true posterior score and quantify the sensitivity to prior approximation error, especially since the skeptic note highlights that the learned prior may not match the data distribution exactly.
Authors: We have expanded Section 3.2 to state the precise condition: score-KSD equals the true posterior score discrepancy if and only if p_θ matches the data-generating distribution exactly. We further derive a bound showing that the error in score-KSD is at most the L2 norm of the score difference between the learned and true prior, scaled by the kernel bandwidth. This sensitivity is quantified empirically in the simulation section by injecting controlled prior mismatch and reporting the resulting deviation in score-KSD values. revision: yes
-
Referee: [Real-world experiments] Real-world validation: without ground truth, effectiveness is asserted via 'meaningful posterior fidelity diagnostics.' The manuscript needs to provide concrete evidence (e.g., correlation with downstream task performance or consistency across multiple priors) that score-KSD rankings are not artifacts of the particular learned p_θ.
Authors: We have added Section 5.4 with two concrete validations. First, score-KSD rankings correlate with downstream calibration error (negative log-likelihood on held-out measurements) at Pearson r = 0.87. Second, we retrain three independent diffusion priors on the same data and show that the relative ordering of DIS methods remains stable (Kendall tau > 0.8). These results indicate that the diagnostics are not artifacts of a single p_θ. revision: yes
Circularity Check
Score-KSD defined from model-induced score; main claim is empirical, not circular
full rationale
The paper introduces score-KSD as a new diagnostic that compares samples to the posterior score induced by the forward model plus the learned diffusion prior p_θ(x). This definition is explicit and acknowledged rather than hidden. In controlled simulations the authors compare against an analytical true posterior, providing an external check. The central claim (accuracy does not imply posterior consistency) is presented as an experimental observation across methods, not as a mathematical identity or prediction forced by fitting. No self-citation load-bearing step, no uniqueness theorem imported from the same authors, and no renaming of a known result appear in the provided text. The dependence on the learned prior is a modeling assumption, not a circular reduction of the reported result to its own inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
score-KSD metric
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
UQ-driven categorization of PnP diffusion methods... Posterior-targeting solvers... Heuristic solvers... MAP-like solvers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Stability Benchmark of Generative Regularizers for Inverse Problems
Numerical benchmarks indicate generative regularizers deliver strong reconstructions in some imaging inverse problem settings but can be unstable or problematic under imperfect conditions compared to variational methods.
Reference graph
Works this paper leans on
-
[1]
Inverse problems in astronomy,
URLhttps://www.osti.gov/biblio/5734250. Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Peyman Milanfar, Alexandros G. Dimakis, and Mauricio Delbracio. A survey on diffusion models for inverse problems, 2024. URLhttps://arxiv.org/abs/2410.00083. Zehao Dou and Yang Song. Diffusion posterior sampling for linear inverse problem s...
-
[2]
Denoising Diffusion Probabilistic Models
URLhttps://arxiv.org/abs/2006.11239. Paul Hofman, Yusuf Sale, and Eyke Hüllermeier. Quantifying aleatoric and epistemic uncertainty with proper scoring rules.arXiv preprint arXiv:2404.12215, 2024. Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods.Machine Learning, 110(3):...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s10994-021-05946-3 2006
-
[3]
Score-Based Generative Modeling through Stochastic Differential Equations
URLhttps://arxiv.org/abs/2011.13456. Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models, 2022b. URLhttps://arxiv.org/abs/ 2111.08005. J. Virieux and S. Operto. An overview of full-waveform inversion in exploration geophysics. Geophysics, 74(6):WCC1–WCC26, 12 2009. ISSN 0016-80...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1190/1.3238367 2011
-
[4]
fastMRI: An Open Dataset and Benchmarks for Accelerated MRI
URLhttps://arxiv.org/abs/1811.08839. Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, and Yang Song. Improving diffusion inverse problem solving with decoupled noise annealing,
work page internal anchor Pith review arXiv
-
[5]
URLhttps://arxiv.org/abs/2407.01521. Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman, and Yisong Yue. Inversebench: Benchmarking plug-and-play diffusion priors for inverse problems in physical sciences. InThe Thirteenth International Conference...
-
[6]
(Score accuracy)In the ideal case, the diffusion score is exact, i.e.,sθ(xt,t ) = ∇xt logpt(xt)(equivalentlyεscore = 0)
-
[7]
(Convergence from initialization)The Markov chain is run sufficiently long so that the initialization-dependent term vanishes, i.e.,K→∞(henceT =K(t∗+1)→ ∞) with a fixedρ>0and regular schedule constants bounded away from0
-
[8]
approximate posterior sampling
(Split bias)The coupling parameter ρ→0, so that the split marginal πρ,X approaches the exact posteriorp(x|y). Theorem B.2(PnP-DM convergence bound (Theorem 3.1 in [Wu et al., 2024])).Consider runningK iterations of PnP-DM with a constant couplingρk≡ρ>0and a score estimate st. Lett∗>0satisfyσ(t∗) =ρ, and define v(t) :=s(t) √ 2 ˙σ(t)σ(t), δ:= inf t∈[0,t∗] v...
work page 2024
-
[9]
(Score accuracy).The learned score is exact, i.e.,sθ(xt,t ) =∇xt logpt(xt)for all t
-
[10]
(Accurate reverse solver).The reverse-time ODE/SDE solver is asymptotically exact as the step size∆t→0(equivalently, the discretization grid is refined so that the induced reverse transition kernel matches the continuous-time reverse dynamics). Remark.Let λt−1(xt−1|xt)denote the learned backward kernel used to sample the reverse trajectory. Under Assumpti...
work page 2023
-
[11]
Score accuracy.The score model is exact for the diffusion marginal, i.e.,sθ(xt,t ) = ∇xt logpt(xt)for allt
-
[12]
Vanishing discretization error.The reverse-time ODE/SDE solver is asymptoti- cally exact as the step size∆t =T/N→0, so that the backward diffusion dynamics incur no discretization error. Theorem B.6(Asymptotic consistency of FPS-SMC [Dou and Song, 2024]).Letpθ(x0:N| y0:N)denote the joint distribution over the diffusion-time path produced by FPS-SMC with M...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.