pith. sign in

arxiv: 2508.20614 · v3 · submitted 2025-08-28 · 📊 stat.ML · cs.LG· stat.CO

Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Pith reviewed 2026-05-18 21:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO
keywords amortized Bayesian model comparisonself-consistency lossmodel misspecificationneural surrogatessimulation-based inferenceopen-world model comparisondistribution shiftmodel selection
0
0 comments X

The pith

Self-consistency training on unlabeled real data improves amortized Bayesian model comparison accuracy when all candidate models are misspecified.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Amortized Bayesian model comparison trains neural networks on simulated data to rank competing models quickly. The paper demonstrates that supplementing this training with a self-consistency loss on real unlabeled data sharpens the resulting estimates precisely in the open-world setting where every model fails to generate the observed data. Classifier-based approaches already work reasonably when one model is correct but gain the least from the added loss. Strong gains appear instead when analytic likelihoods exist or when surrogate likelihoods track the true posterior locally, even if the overall model is severely wrong. This matters because model comparison delivers the most value exactly when perfect specification is impossible.

Core claim

In the open-world scenario where all models are misspecified, self-consistency training on unlabeled real data strongly improves the accuracy of amortized BMC estimators that rely on analytic likelihoods or on surrogate likelihoods that remain locally accurate near the true parameter posterior, even for severely misspecified models.

What carries the argument

Self-consistency loss applied to unlabeled real data, which regularizes the neural surrogates by enforcing agreement in model probability estimates across different transformations of the same observations.

If this is right

  • Classifier-based BMC methods perform adequately without self-consistency in closed-world settings but improve the least from the added training.
  • Self-consistency yields substantial gains in open-world settings whenever analytic likelihoods or locally accurate surrogates are present.
  • Practical use of amortized BMC should incorporate self-consistency whenever misspecification is suspected.
  • Future extensions could explore self-consistency for other forms of distribution shift in simulation-based inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar consistency losses on real data could improve other amortized inference pipelines that face distribution shifts.
  • The approach suggests hybrid simulation-plus-real training can lessen dependence on perfectly specified generative models.
  • Testing the method on hierarchical or high-dimensional models would clarify how far the local-accuracy condition can stretch.
  • The pattern connects to consistency regularization techniques already used to handle shifts in supervised learning.

Load-bearing premise

Surrogate likelihoods must remain locally accurate near the true parameter posterior or analytic likelihoods must be available for at least some of the models being compared.

What would settle it

An experiment showing no accuracy improvement from self-consistency training in an open-world case where surrogate likelihoods deviate substantially outside the region near the true posterior would falsify the central claim.

Figures

Figures reproduced from arXiv: 2508.20614 by Aayush Mishra, Daniel Habermann, Paul-Christian B\"urkner, Stefan T. Radev, \v{S}imon Kucharsk\'y.

Figure 1
Figure 1. Figure 1: Log marginal likelihood estimates comparing neural estimators ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 15 countries was used to evaluate the SC loss. 5.2 Experiment 2: Racing diffusion models of decision making Next, we analyze real data from the lexical decision task of Wagenmakers et al. [24] (Experiment 1), where 17 participants judged letter strings as words or non-words under… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 4 countries was used to evaluate the SC loss and log p(y) was estimated using 256 Monte Carlo samples. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 8 countries was used to evaluate the SC loss and log p(y) was estimated using 256 Monte Carlo samples. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the accuracy of neural surrogates deteriorates when simulation models are misspecified; the very case where model comparison is most needed. We evaluate four different amortized BMC methods. We supplement traditional simulation-based training of these methods with a \emph{self-consistency} (SC) loss on unlabeled real data to improve BMC estimates under distribution shifts. Using one artificial and two real-world case studies, we compare amortized BMC estimators with and without SC against analytic or bridge sampling benchmarks. In the \emph{closed-world} case (data is generated by one of the candidate models), BMC estimators using classifiers work acceptably well even without SC training. However, these methods also benefit the least from SC training. In the \emph{open-world} scenario (all models misspecified), SC training strongly improves BMC estimators when having access to analytic likelihoods, or when surrogate likelihoods are locally accurate near the true parameter posterior, even for severely misspecified models. We conclude with practical recommendations for amortized BMC and suggestions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a self-consistency (SC) loss on unlabeled real data to supplement simulation-based training of four amortized Bayesian model comparison (BMC) methods. It evaluates these with and without SC on one artificial and two real-world case studies, benchmarking against analytic likelihoods or bridge sampling. Key findings: classifiers perform adequately without SC in closed-world settings (data generated by a candidate model) but gain least from SC; in open-world settings (all models misspecified), SC yields strong improvements when analytic likelihoods are available or when surrogate likelihoods remain locally accurate near the true posterior, even under severe misspecification. The work concludes with practical recommendations.

Significance. If the central findings hold after addressing verification gaps, the work would offer a practical enhancement to amortized BMC for the common case of misspecified simulation models, where model comparison is most needed. The explicit conditioning of gains on local surrogate accuracy and the closed- vs. open-world distinction provide useful guidance, and the use of external benchmarks (analytic/bridge sampling) strengthens the evaluation relative to purely internal metrics.

major comments (1)
  1. [real-world case studies / open-world scenario] Real-world case studies (open-world experiments without analytic likelihoods): The claim that SC training strongly improves BMC estimators 'even for severely misspecified models' when surrogate likelihoods are locally accurate near the true parameter posterior is load-bearing for the open-world conclusion, yet no independent diagnostic is reported to confirm this local accuracy condition holds (e.g., no local KL divergence, restricted posterior-predictive calibration, or likelihood-ratio tests in a neighborhood of the inferred posterior). Without such a check, observed gains cannot be confidently attributed to the stated mechanism rather than other factors.
minor comments (2)
  1. [abstract and experiments] The abstract and experiments section would benefit from explicit reporting of error bars, number of runs, and any exclusion criteria for the benchmark comparisons to allow readers to assess variability and robustness.
  2. [methods] Clarify the precise definitions and architectural differences among the four amortized BMC methods early in the methods section, including how the SC loss is integrated with each.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The distinction between closed- and open-world settings and the emphasis on local surrogate accuracy are central to our claims; we address the verification concern below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [real-world case studies / open-world scenario] Real-world case studies (open-world experiments without analytic likelihoods): The claim that SC training strongly improves BMC estimators 'even for severely misspecified models' when surrogate likelihoods are locally accurate near the true parameter posterior is load-bearing for the open-world conclusion, yet no independent diagnostic is reported to confirm this local accuracy condition holds (e.g., no local KL divergence, restricted posterior-predictive calibration, or likelihood-ratio tests in a neighborhood of the inferred posterior). Without such a check, observed gains cannot be confidently attributed to the stated mechanism rather than other factors.

    Authors: We agree that an independent diagnostic would strengthen attribution of the observed SC gains to the local-accuracy mechanism rather than to other factors. In the real-world case studies the surrogate likelihoods were chosen on the basis of domain knowledge and were already subjected to global posterior-predictive checks; however, these checks were not restricted to neighborhoods of the inferred posterior. We will revise the manuscript to (i) report restricted posterior-predictive calibration and local likelihood-ratio diagnostics where computationally feasible, (ii) add an explicit limitations paragraph stating that the local-accuracy assumption remains partly inferential in the absence of analytic likelihoods, and (iii) temper the wording of the open-world conclusion to reflect this conditional support. These additions do not require new simulation experiments but will make the evidential basis for the claim more transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: improvements validated against external analytic and bridge-sampling benchmarks.

full rationale

The paper defines a self-consistency loss added to simulation-based training of neural surrogates for amortized BMC and reports empirical gains in open-world misspecification settings. All reported improvements are measured against independent external references (analytic likelihoods or bridge sampling) rather than quantities defined by the fitted parameters or loss inside the paper. No derivation step equates a prediction to its own input by construction, renames a fitted quantity as a forecast, or relies on a load-bearing self-citation whose validity is presupposed by the present work. The central claims therefore remain self-contained and falsifiable outside the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or ad-hoc axioms are stated in the provided text.

axioms (1)
  • domain assumption Neural surrogates trained on simulations can approximate model posterior probabilities
    Implicit foundation of all amortized BMC methods referenced in the abstract.

pith-pipeline@v0.9.0 · 5755 in / 1208 out tokens · 38781 ms · 2026-05-18T21:12:43.040819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    Cambridge university press, 2003

    David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003

  2. [2]

    Bayesian model selection, the marginal likelihood, and generalization

    Sanae Lotfi, Pavel Izmailov, Gregory Benton, Micah Goldblum, and Andrew Gordon Wil- son. Bayesian model selection, the marginal likelihood, and generalization. In International Conference on Machine Learning, pages 14223–14247. PMLR, 2022

  3. [3]

    Amortized bayesian model comparison with evidential deep learning

    Stefan T Radev, Marco D’Alessandro, Ulf K Mertens, Andreas V oss, Ullrich Koethe, and Paul-Christian Buerkner. Amortized bayesian model comparison with evidential deep learning. IEEE Transactions on Neural Networks and Learning Systems, 34(8):4903–4917, 2021

  4. [4]

    JANA: Jointly amortized neural approximation of complex Bayesian models

    Stefan T Radev, Marvin Schmitt, Valentin Pratz, Umberto Picchini, Ullrich Köthe, and Paul- Christian Bürkner. JANA: Jointly amortized neural approximation of complex Bayesian models. In Uncertainty in Artificial Intelligence, pages 1695–1706. PMLR, 2023

  5. [5]

    Generalized out-of-distribution detection: A survey

    Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, 2024

  6. [6]

    Detecting model misspecification in amortized Bayesian inference with neural networks

    Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, and Stefan T Radev. Detecting model misspecification in amortized Bayesian inference with neural networks. In Dagm german conference on pattern recognition, pages 541–557. Springer, 2023

  7. [7]

    The statistical accuracy of neural posterior and likelihood estimation

    David T Frazier, Ryan Kelly, Christopher Drovandi, and David J Warne. The statistical accuracy of neural posterior and likelihood estimation. arXiv preprint arXiv:2411.12068, 2024

  8. [8]

    Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T

    Marvin Schmitt, Desi R. Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T. Radev. Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference, July 2024. URL http://arxiv.org/abs/2310.04395. arXiv:2310.04395 [cs]

  9. [9]

    Ivanova, Marvin Schmitt, and Stefan T

    Desi R. Ivanova, Marvin Schmitt, and Stefan T. Radev. Data-Efficient Variational Mutual Infor- mation Estimation via Bayesian Self-Consistency. In NeurIPS BDU Workshop 2024, October

  10. [10]

    URL https://openreview.net/forum?id=QfiyElaO1f&noteId=aRvehpmMkK

  11. [11]

    Radev, and Paul-Christian Bürkner

    Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, and Paul-Christian Bürkner. Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data, May 2025. URL http://arxiv.org/abs/2501.13483. arXiv:2501.13483 [stat]. 5

  12. [12]

    Bayes factors

    Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995

  13. [13]

    bridgesampling: An R package for estimating normalizing constants

    Quentin F Gronau, Henrik Singmann, and Eric-Jan Wagenmakers. bridgesampling: An R package for estimating normalizing constants. Journal of Statistical Software, 92:1–29, 2020

  14. [14]

    Marginal likelihood computation for model selection and hypothesis testing: an extensive review

    Fernando Llorente, Luca Martino, David Delgado, and Javier Lopez-Santiago. Marginal likelihood computation for model selection and hypothesis testing: an extensive review. SIAM review, 65(1):3–58, 2023

  15. [15]

    Bayesian evidence and model selection

    Kevin H Knuth, Michael Habeck, Nabin K Malakar, Asim M Mubeen, and Ben Placek. Bayesian evidence and model selection. Digital Signal Processing, 47:50–67, 2015

  16. [16]

    Strictly proper scoring rules, prediction, and estimation

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007

  17. [17]

    https://doi.org/10.48550/arXiv.2311.15650, arXiv:2311.15650

    Konstantin Karchev, Roberto Trotta, and Christoph Weniger. SimSIMS: Simulation-based Su- pernova Ia Model Selection with thousands of latent variables.arXiv preprint arXiv:2311.15650, 2023

  18. [18]

    How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices

    Konstantina Sokratous, Anderson K Fitch, and Peter D Kvam. How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices. Journal of choice modelling, 48:100418, 2023

  19. [19]

    Validation and comparison of non-stationary cognitive models: A diffusion model application

    Lukas Schumacher, Martin Schnuerch, Andreas V oss, and Stefan T Radev. Validation and comparison of non-stationary cognitive models: A diffusion model application. Computational Brain & Behavior, 8(2):191–210, 2025

  20. [20]

    A deep learning method for comparing Bayesian hierarchical models

    Lasse Elsemüller, Martin Schnuerch, Paul-Christian Bürkner, and Stefan T Radev. A deep learning method for comparing Bayesian hierarchical models. Psychological Methods, 2024

  21. [21]

    Simultaneous identification of models and parameters of scientific simulators

    Cornelius Schröder and Jakob H Macke. Simultaneous identification of models and parameters of scientific simulators. In Proceedings of the 41st International Conference on Machine Learning, pages 43895–43927, 2024

  22. [22]

    Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison

    Niall Jeffrey and Benjamin D Wandelt. Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison. Machine Learning: Science and Technology, 5(1):015008, 2024

  23. [23]

    Bayesian model comparison for simulation-based inference

    A Spurio Mancini, MM Docherty, MA Price, and JD McEwen. Bayesian model comparison for simulation-based inference. RAS Techniques and Instruments, 2(1):710–722, 2023

  24. [24]

    Bayesian evidence estimation from posterior samples with normalizing flows

    Rahul Srinivasan, Marco Crisostomi, Roberto Trotta, Enrico Barausse, and Matteo Breschi. Bayesian evidence estimation from posterior samples with normalizing flows. Physical Review D, 110(12):123007, 2024

  25. [25]

    A diffusion model account of criterion shifts in the lexical decision task

    Eric-Jan Wagenmakers, Roger Ratcliff, Pablo Gomez, and Gail McKoon. A diffusion model account of criterion shifts in the lexical decision task. Journal of memory and language, 58(1): 140–159, 2008

  26. [26]

    Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making

    Gabriel Tillman, Trish Van Zandt, and Gordon D Logan. Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making. Psychonomic Bulletin & Review, 27(5):911–936, 2020

  27. [27]

    Stan Reference Manual, 2025

    Stan Development Team. Stan Reference Manual, 2025. URL https://mc-stan.org/. version 2.32.2

  28. [28]

    RStan: the R interface to Stan, 2025

    Stan Development Team. RStan: the R interface to Stan, 2025. URL https://mc-stan.org/. R package version 2.32.7

  29. [29]

    International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

    Eurostat. International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

  30. [30]

    Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022

    Eurostat. Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022. 6

  31. [31]

    Real gdp per capita, doi:10.2908/SDG_08_10, 2022

    Eurostat. Real gdp per capita, doi:10.2908/SDG_08_10, 2022

  32. [32]

    Density estimation using Real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016

  33. [33]

    Mish: A self regularized non-monotonic activation function

    Diganta Misra. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681, 2019

  34. [34]

    Deep sets

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017

  35. [35]

    Neural spline flows

    Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

  36. [36]

    Neural importance sampling

    Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. ACM Transactions on Graphics (ToG), 38(5):1–19, 2019. 7 Appendix A The marginal likelihood and self-consistency We start with the Bayes’ theorem for parameter posterior given data, p(θk | y, Mk) = p(θk | Mk) × p(y | θk, Mk) p(y | Mk) , (8) where th...