Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Aayush Mishra; Daniel Habermann; Paul-Christian B\"urkner; Stefan T. Radev; \v{S}imon Kucharsk\'y

arxiv: 2508.20614 · v3 · submitted 2025-08-28 · 📊 stat.ML · cs.LG· stat.CO

Improving the Accuracy of Amortized Model Comparison with Self-Consistency

\v{S}imon Kucharsk\'y , Aayush Mishra , Daniel Habermann , Stefan T. Radev , Paul-Christian B\"urkner This is my paper

Pith reviewed 2026-05-18 21:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO

keywords amortized Bayesian model comparisonself-consistency lossmodel misspecificationneural surrogatessimulation-based inferenceopen-world model comparisondistribution shiftmodel selection

0 comments

The pith

Self-consistency training on unlabeled real data improves amortized Bayesian model comparison accuracy when all candidate models are misspecified.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Amortized Bayesian model comparison trains neural networks on simulated data to rank competing models quickly. The paper demonstrates that supplementing this training with a self-consistency loss on real unlabeled data sharpens the resulting estimates precisely in the open-world setting where every model fails to generate the observed data. Classifier-based approaches already work reasonably when one model is correct but gain the least from the added loss. Strong gains appear instead when analytic likelihoods exist or when surrogate likelihoods track the true posterior locally, even if the overall model is severely wrong. This matters because model comparison delivers the most value exactly when perfect specification is impossible.

Core claim

In the open-world scenario where all models are misspecified, self-consistency training on unlabeled real data strongly improves the accuracy of amortized BMC estimators that rely on analytic likelihoods or on surrogate likelihoods that remain locally accurate near the true parameter posterior, even for severely misspecified models.

What carries the argument

Self-consistency loss applied to unlabeled real data, which regularizes the neural surrogates by enforcing agreement in model probability estimates across different transformations of the same observations.

If this is right

Classifier-based BMC methods perform adequately without self-consistency in closed-world settings but improve the least from the added training.
Self-consistency yields substantial gains in open-world settings whenever analytic likelihoods or locally accurate surrogates are present.
Practical use of amortized BMC should incorporate self-consistency whenever misspecification is suspected.
Future extensions could explore self-consistency for other forms of distribution shift in simulation-based inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar consistency losses on real data could improve other amortized inference pipelines that face distribution shifts.
The approach suggests hybrid simulation-plus-real training can lessen dependence on perfectly specified generative models.
Testing the method on hierarchical or high-dimensional models would clarify how far the local-accuracy condition can stretch.
The pattern connects to consistency regularization techniques already used to handle shifts in supervised learning.

Load-bearing premise

Surrogate likelihoods must remain locally accurate near the true parameter posterior or analytic likelihoods must be available for at least some of the models being compared.

What would settle it

An experiment showing no accuracy improvement from self-consistency training in an open-world case where surrogate likelihoods deviate substantially outside the region near the true posterior would falsify the central claim.

Figures

Figures reproduced from arXiv: 2508.20614 by Aayush Mishra, Daniel Habermann, Paul-Christian B\"urkner, Stefan T. Radev, \v{S}imon Kucharsk\'y.

**Figure 2.** Figure 2: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 15 countries was used to evaluate the SC loss. 5.2 Experiment 2: Racing diffusion models of decision making Next, we analyze real data from the lexical decision task of Wagenmakers et al. [24] (Experiment 1), where 17 participants judged letter strings as words or non-words under… view at source ↗

**Figure 3.** Figure 3: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 4 countries was used to evaluate the SC loss and log p(y) was estimated using 256 Monte Carlo samples. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the estimated log p(y) from NPE+SC and NPE against the gold-standard bridge-sampling results. The dataset from M = 8 countries was used to evaluate the SC loss and log p(y) was estimated using 256 Monte Carlo samples. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the accuracy of neural surrogates deteriorates when simulation models are misspecified; the very case where model comparison is most needed. We evaluate four different amortized BMC methods. We supplement traditional simulation-based training of these methods with a \emph{self-consistency} (SC) loss on unlabeled real data to improve BMC estimates under distribution shifts. Using one artificial and two real-world case studies, we compare amortized BMC estimators with and without SC against analytic or bridge sampling benchmarks. In the \emph{closed-world} case (data is generated by one of the candidate models), BMC estimators using classifiers work acceptably well even without SC training. However, these methods also benefit the least from SC training. In the \emph{open-world} scenario (all models misspecified), SC training strongly improves BMC estimators when having access to analytic likelihoods, or when surrogate likelihoods are locally accurate near the true parameter posterior, even for severely misspecified models. We conclude with practical recommendations for amortized BMC and suggestions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-consistency adds a useful robustness tweak for amortized BMC under misspecification, but gains in real data rest on an unverified local-accuracy assumption for the surrogates.

read the letter

The main thing to know is that this paper tests adding a self-consistency loss on unlabeled real data to four amortized Bayesian model comparison methods and finds clear gains in open-world settings where every candidate model is misspecified. The improvements show up against analytic and bridge-sampling benchmarks on both artificial and real examples, and the authors separate closed-world from open-world behavior in a straightforward way. Classifiers hold up decently without the extra loss but pick up less from it, while other estimators benefit more when the condition on local surrogate accuracy holds. That distinction and the practical recommendations at the end are the parts that feel most usable right now. The experiments are anchored to external benchmarks rather than circular internal quantities, which keeps the claims grounded. The soft spot is exactly the one the stress-test flags: the strongest reported gains in the severely misspecified real-data cases assume the surrogate likelihoods stay locally accurate near the true posterior, yet the write-up does not supply a separate check such as local KL or restricted posterior-predictive diagnostics to confirm that premise actually holds when analytic likelihoods are unavailable. If that assumption slips, the observed lift could shrink or disappear. The work is aimed at people already doing simulation-based inference and amortized model comparison who want a simple training adjustment for robustness. It is not reorganizing the field, but the empirical comparisons are concrete enough and the open-world framing is honest enough that it deserves a full referee rather than a desk reject. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces a self-consistency (SC) loss on unlabeled real data to supplement simulation-based training of four amortized Bayesian model comparison (BMC) methods. It evaluates these with and without SC on one artificial and two real-world case studies, benchmarking against analytic likelihoods or bridge sampling. Key findings: classifiers perform adequately without SC in closed-world settings (data generated by a candidate model) but gain least from SC; in open-world settings (all models misspecified), SC yields strong improvements when analytic likelihoods are available or when surrogate likelihoods remain locally accurate near the true posterior, even under severe misspecification. The work concludes with practical recommendations.

Significance. If the central findings hold after addressing verification gaps, the work would offer a practical enhancement to amortized BMC for the common case of misspecified simulation models, where model comparison is most needed. The explicit conditioning of gains on local surrogate accuracy and the closed- vs. open-world distinction provide useful guidance, and the use of external benchmarks (analytic/bridge sampling) strengthens the evaluation relative to purely internal metrics.

major comments (1)

[real-world case studies / open-world scenario] Real-world case studies (open-world experiments without analytic likelihoods): The claim that SC training strongly improves BMC estimators 'even for severely misspecified models' when surrogate likelihoods are locally accurate near the true parameter posterior is load-bearing for the open-world conclusion, yet no independent diagnostic is reported to confirm this local accuracy condition holds (e.g., no local KL divergence, restricted posterior-predictive calibration, or likelihood-ratio tests in a neighborhood of the inferred posterior). Without such a check, observed gains cannot be confidently attributed to the stated mechanism rather than other factors.

minor comments (2)

[abstract and experiments] The abstract and experiments section would benefit from explicit reporting of error bars, number of runs, and any exclusion criteria for the benchmark comparisons to allow readers to assess variability and robustness.
[methods] Clarify the precise definitions and architectural differences among the four amortized BMC methods early in the methods section, including how the SC loss is integrated with each.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The distinction between closed- and open-world settings and the emphasis on local surrogate accuracy are central to our claims; we address the verification concern below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [real-world case studies / open-world scenario] Real-world case studies (open-world experiments without analytic likelihoods): The claim that SC training strongly improves BMC estimators 'even for severely misspecified models' when surrogate likelihoods are locally accurate near the true parameter posterior is load-bearing for the open-world conclusion, yet no independent diagnostic is reported to confirm this local accuracy condition holds (e.g., no local KL divergence, restricted posterior-predictive calibration, or likelihood-ratio tests in a neighborhood of the inferred posterior). Without such a check, observed gains cannot be confidently attributed to the stated mechanism rather than other factors.

Authors: We agree that an independent diagnostic would strengthen attribution of the observed SC gains to the local-accuracy mechanism rather than to other factors. In the real-world case studies the surrogate likelihoods were chosen on the basis of domain knowledge and were already subjected to global posterior-predictive checks; however, these checks were not restricted to neighborhoods of the inferred posterior. We will revise the manuscript to (i) report restricted posterior-predictive calibration and local likelihood-ratio diagnostics where computationally feasible, (ii) add an explicit limitations paragraph stating that the local-accuracy assumption remains partly inferential in the absence of analytic likelihoods, and (iii) temper the wording of the open-world conclusion to reflect this conditional support. These additions do not require new simulation experiments but will make the evidential basis for the claim more transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: improvements validated against external analytic and bridge-sampling benchmarks.

full rationale

The paper defines a self-consistency loss added to simulation-based training of neural surrogates for amortized BMC and reports empirical gains in open-world misspecification settings. All reported improvements are measured against independent external references (analytic likelihoods or bridge sampling) rather than quantities defined by the fitted parameters or loss inside the paper. No derivation step equates a prediction to its own input by construction, renames a fitted quantity as a forecast, or relies on a load-bearing self-citation whose validity is presupposed by the present work. The central claims therefore remain self-contained and falsifiable outside the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or ad-hoc axioms are stated in the provided text.

axioms (1)

domain assumption Neural surrogates trained on simulations can approximate model posterior probabilities
Implicit foundation of all amortized BMC methods referenced in the abstract.

pith-pipeline@v0.9.0 · 5755 in / 1208 out tokens · 38781 ms · 2026-05-18T21:12:43.040819+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We supplement simulation-based training with a self-consistency (SC) loss on unlabeled real data to improve BMC estimates under empirical distribution shifts.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

log p(y | Mk) ≈ log p(θ∗k | Mk) + log p(y | θ∗k, Mk) − log qϕ(θ∗k | y, Mk)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Cambridge university press, 2003

David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003

work page 2003
[2]

Bayesian model selection, the marginal likelihood, and generalization

Sanae Lotfi, Pavel Izmailov, Gregory Benton, Micah Goldblum, and Andrew Gordon Wil- son. Bayesian model selection, the marginal likelihood, and generalization. In International Conference on Machine Learning, pages 14223–14247. PMLR, 2022

work page 2022
[3]

Amortized bayesian model comparison with evidential deep learning

Stefan T Radev, Marco D’Alessandro, Ulf K Mertens, Andreas V oss, Ullrich Koethe, and Paul-Christian Buerkner. Amortized bayesian model comparison with evidential deep learning. IEEE Transactions on Neural Networks and Learning Systems, 34(8):4903–4917, 2021

work page 2021
[4]

JANA: Jointly amortized neural approximation of complex Bayesian models

Stefan T Radev, Marvin Schmitt, Valentin Pratz, Umberto Picchini, Ullrich Köthe, and Paul- Christian Bürkner. JANA: Jointly amortized neural approximation of complex Bayesian models. In Uncertainty in Artificial Intelligence, pages 1695–1706. PMLR, 2023

work page 2023
[5]

Generalized out-of-distribution detection: A survey

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, 2024

work page 2024
[6]

Detecting model misspecification in amortized Bayesian inference with neural networks

Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, and Stefan T Radev. Detecting model misspecification in amortized Bayesian inference with neural networks. In Dagm german conference on pattern recognition, pages 541–557. Springer, 2023

work page 2023
[7]

The statistical accuracy of neural posterior and likelihood estimation

David T Frazier, Ryan Kelly, Christopher Drovandi, and David J Warne. The statistical accuracy of neural posterior and likelihood estimation. arXiv preprint arXiv:2411.12068, 2024

work page arXiv 2024
[8]

Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T

Marvin Schmitt, Desi R. Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T. Radev. Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference, July 2024. URL http://arxiv.org/abs/2310.04395. arXiv:2310.04395 [cs]

work page arXiv 2024
[9]

Ivanova, Marvin Schmitt, and Stefan T

Desi R. Ivanova, Marvin Schmitt, and Stefan T. Radev. Data-Efficient Variational Mutual Infor- mation Estimation via Bayesian Self-Consistency. In NeurIPS BDU Workshop 2024, October

work page 2024
[10]

URL https://openreview.net/forum?id=QfiyElaO1f&noteId=aRvehpmMkK

work page
[11]

Radev, and Paul-Christian Bürkner

Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, and Paul-Christian Bürkner. Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data, May 2025. URL http://arxiv.org/abs/2501.13483. arXiv:2501.13483 [stat]. 5

work page arXiv 2025
[12]

Bayes factors

Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995

work page 1995
[13]

bridgesampling: An R package for estimating normalizing constants

Quentin F Gronau, Henrik Singmann, and Eric-Jan Wagenmakers. bridgesampling: An R package for estimating normalizing constants. Journal of Statistical Software, 92:1–29, 2020

work page 2020
[14]

Marginal likelihood computation for model selection and hypothesis testing: an extensive review

Fernando Llorente, Luca Martino, David Delgado, and Javier Lopez-Santiago. Marginal likelihood computation for model selection and hypothesis testing: an extensive review. SIAM review, 65(1):3–58, 2023

work page 2023
[15]

Bayesian evidence and model selection

Kevin H Knuth, Michael Habeck, Nabin K Malakar, Asim M Mubeen, and Ben Placek. Bayesian evidence and model selection. Digital Signal Processing, 47:50–67, 2015

work page 2015
[16]

Strictly proper scoring rules, prediction, and estimation

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007

work page 2007
[17]

https://doi.org/10.48550/arXiv.2311.15650, arXiv:2311.15650

Konstantin Karchev, Roberto Trotta, and Christoph Weniger. SimSIMS: Simulation-based Su- pernova Ia Model Selection with thousands of latent variables.arXiv preprint arXiv:2311.15650, 2023

work page arXiv 2023
[18]

How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices

Konstantina Sokratous, Anderson K Fitch, and Peter D Kvam. How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices. Journal of choice modelling, 48:100418, 2023

work page 2023
[19]

Validation and comparison of non-stationary cognitive models: A diffusion model application

Lukas Schumacher, Martin Schnuerch, Andreas V oss, and Stefan T Radev. Validation and comparison of non-stationary cognitive models: A diffusion model application. Computational Brain & Behavior, 8(2):191–210, 2025

work page 2025
[20]

A deep learning method for comparing Bayesian hierarchical models

Lasse Elsemüller, Martin Schnuerch, Paul-Christian Bürkner, and Stefan T Radev. A deep learning method for comparing Bayesian hierarchical models. Psychological Methods, 2024

work page 2024
[21]

Simultaneous identification of models and parameters of scientific simulators

Cornelius Schröder and Jakob H Macke. Simultaneous identification of models and parameters of scientific simulators. In Proceedings of the 41st International Conference on Machine Learning, pages 43895–43927, 2024

work page 2024
[22]

Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison

Niall Jeffrey and Benjamin D Wandelt. Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison. Machine Learning: Science and Technology, 5(1):015008, 2024

work page 2024
[23]

Bayesian model comparison for simulation-based inference

A Spurio Mancini, MM Docherty, MA Price, and JD McEwen. Bayesian model comparison for simulation-based inference. RAS Techniques and Instruments, 2(1):710–722, 2023

work page 2023
[24]

Bayesian evidence estimation from posterior samples with normalizing flows

Rahul Srinivasan, Marco Crisostomi, Roberto Trotta, Enrico Barausse, and Matteo Breschi. Bayesian evidence estimation from posterior samples with normalizing flows. Physical Review D, 110(12):123007, 2024

work page 2024
[25]

A diffusion model account of criterion shifts in the lexical decision task

Eric-Jan Wagenmakers, Roger Ratcliff, Pablo Gomez, and Gail McKoon. A diffusion model account of criterion shifts in the lexical decision task. Journal of memory and language, 58(1): 140–159, 2008

work page 2008
[26]

Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making

Gabriel Tillman, Trish Van Zandt, and Gordon D Logan. Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making. Psychonomic Bulletin & Review, 27(5):911–936, 2020

work page 2020
[27]

Stan Reference Manual, 2025

Stan Development Team. Stan Reference Manual, 2025. URL https://mc-stan.org/. version 2.32.2

work page 2025
[28]

RStan: the R interface to Stan, 2025

Stan Development Team. RStan: the R interface to Stan, 2025. URL https://mc-stan.org/. R package version 2.32.7

work page 2025
[29]

International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

Eurostat. International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

work page doi:10.2908/avia_paexcc 2022
[30]

Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022

Eurostat. Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022. 6

work page doi:10.2908/tipsd22 2022
[31]

Real gdp per capita, doi:10.2908/SDG_08_10, 2022

Eurostat. Real gdp per capita, doi:10.2908/SDG_08_10, 2022

work page doi:10.2908/sdg_08_10 2022
[32]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Mish: A self regularized non-monotonic activation function

Diganta Misra. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681, 2019

work page arXiv 1908
[34]

Deep sets

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017

work page 2017
[35]

Neural spline flows

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

work page 2019
[36]

Neural importance sampling

Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. ACM Transactions on Graphics (ToG), 38(5):1–19, 2019. 7 Appendix A The marginal likelihood and self-consistency We start with the Bayes’ theorem for parameter posterior given data, p(θk | y, Mk) = p(θk | Mk) × p(y | θk, Mk) p(y | Mk) , (8) where th...

work page 2019

[1] [1]

Cambridge university press, 2003

David JC MacKay.Information theory, inference and learning algorithms. Cambridge university press, 2003

work page 2003

[2] [2]

Bayesian model selection, the marginal likelihood, and generalization

Sanae Lotfi, Pavel Izmailov, Gregory Benton, Micah Goldblum, and Andrew Gordon Wil- son. Bayesian model selection, the marginal likelihood, and generalization. In International Conference on Machine Learning, pages 14223–14247. PMLR, 2022

work page 2022

[3] [3]

Amortized bayesian model comparison with evidential deep learning

Stefan T Radev, Marco D’Alessandro, Ulf K Mertens, Andreas V oss, Ullrich Koethe, and Paul-Christian Buerkner. Amortized bayesian model comparison with evidential deep learning. IEEE Transactions on Neural Networks and Learning Systems, 34(8):4903–4917, 2021

work page 2021

[4] [4]

JANA: Jointly amortized neural approximation of complex Bayesian models

Stefan T Radev, Marvin Schmitt, Valentin Pratz, Umberto Picchini, Ullrich Köthe, and Paul- Christian Bürkner. JANA: Jointly amortized neural approximation of complex Bayesian models. In Uncertainty in Artificial Intelligence, pages 1695–1706. PMLR, 2023

work page 2023

[5] [5]

Generalized out-of-distribution detection: A survey

Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12):5635–5662, 2024

work page 2024

[6] [6]

Detecting model misspecification in amortized Bayesian inference with neural networks

Marvin Schmitt, Paul-Christian Bürkner, Ullrich Köthe, and Stefan T Radev. Detecting model misspecification in amortized Bayesian inference with neural networks. In Dagm german conference on pattern recognition, pages 541–557. Springer, 2023

work page 2023

[7] [7]

The statistical accuracy of neural posterior and likelihood estimation

David T Frazier, Ryan Kelly, Christopher Drovandi, and David J Warne. The statistical accuracy of neural posterior and likelihood estimation. arXiv preprint arXiv:2411.12068, 2024

work page arXiv 2024

[8] [8]

Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T

Marvin Schmitt, Desi R. Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, and Stefan T. Radev. Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference, July 2024. URL http://arxiv.org/abs/2310.04395. arXiv:2310.04395 [cs]

work page arXiv 2024

[9] [9]

Ivanova, Marvin Schmitt, and Stefan T

Desi R. Ivanova, Marvin Schmitt, and Stefan T. Radev. Data-Efficient Variational Mutual Infor- mation Estimation via Bayesian Self-Consistency. In NeurIPS BDU Workshop 2024, October

work page 2024

[10] [10]

URL https://openreview.net/forum?id=QfiyElaO1f&noteId=aRvehpmMkK

work page

[11] [11]

Radev, and Paul-Christian Bürkner

Aayush Mishra, Daniel Habermann, Marvin Schmitt, Stefan T. Radev, and Paul-Christian Bürkner. Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data, May 2025. URL http://arxiv.org/abs/2501.13483. arXiv:2501.13483 [stat]. 5

work page arXiv 2025

[12] [12]

Bayes factors

Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995

work page 1995

[13] [13]

bridgesampling: An R package for estimating normalizing constants

Quentin F Gronau, Henrik Singmann, and Eric-Jan Wagenmakers. bridgesampling: An R package for estimating normalizing constants. Journal of Statistical Software, 92:1–29, 2020

work page 2020

[14] [14]

Marginal likelihood computation for model selection and hypothesis testing: an extensive review

Fernando Llorente, Luca Martino, David Delgado, and Javier Lopez-Santiago. Marginal likelihood computation for model selection and hypothesis testing: an extensive review. SIAM review, 65(1):3–58, 2023

work page 2023

[15] [15]

Bayesian evidence and model selection

Kevin H Knuth, Michael Habeck, Nabin K Malakar, Asim M Mubeen, and Ben Placek. Bayesian evidence and model selection. Digital Signal Processing, 47:50–67, 2015

work page 2015

[16] [16]

Strictly proper scoring rules, prediction, and estimation

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007

work page 2007

[17] [17]

https://doi.org/10.48550/arXiv.2311.15650, arXiv:2311.15650

Konstantin Karchev, Roberto Trotta, and Christoph Weniger. SimSIMS: Simulation-based Su- pernova Ia Model Selection with thousands of latent variables.arXiv preprint arXiv:2311.15650, 2023

work page arXiv 2023

[18] [18]

How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices

Konstantina Sokratous, Anderson K Fitch, and Peter D Kvam. How to ask twenty questions and win: Machine learning tools for assessing preferences from small samples of willingness-to-pay prices. Journal of choice modelling, 48:100418, 2023

work page 2023

[19] [19]

Validation and comparison of non-stationary cognitive models: A diffusion model application

Lukas Schumacher, Martin Schnuerch, Andreas V oss, and Stefan T Radev. Validation and comparison of non-stationary cognitive models: A diffusion model application. Computational Brain & Behavior, 8(2):191–210, 2025

work page 2025

[20] [20]

A deep learning method for comparing Bayesian hierarchical models

Lasse Elsemüller, Martin Schnuerch, Paul-Christian Bürkner, and Stefan T Radev. A deep learning method for comparing Bayesian hierarchical models. Psychological Methods, 2024

work page 2024

[21] [21]

Simultaneous identification of models and parameters of scientific simulators

Cornelius Schröder and Jakob H Macke. Simultaneous identification of models and parameters of scientific simulators. In Proceedings of the 41st International Conference on Machine Learning, pages 43895–43927, 2024

work page 2024

[22] [22]

Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison

Niall Jeffrey and Benjamin D Wandelt. Evidence Networks: Simple losses for fast, amortized, neural Bayesian model comparison. Machine Learning: Science and Technology, 5(1):015008, 2024

work page 2024

[23] [23]

Bayesian model comparison for simulation-based inference

A Spurio Mancini, MM Docherty, MA Price, and JD McEwen. Bayesian model comparison for simulation-based inference. RAS Techniques and Instruments, 2(1):710–722, 2023

work page 2023

[24] [24]

Bayesian evidence estimation from posterior samples with normalizing flows

Rahul Srinivasan, Marco Crisostomi, Roberto Trotta, Enrico Barausse, and Matteo Breschi. Bayesian evidence estimation from posterior samples with normalizing flows. Physical Review D, 110(12):123007, 2024

work page 2024

[25] [25]

A diffusion model account of criterion shifts in the lexical decision task

Eric-Jan Wagenmakers, Roger Ratcliff, Pablo Gomez, and Gail McKoon. A diffusion model account of criterion shifts in the lexical decision task. Journal of memory and language, 58(1): 140–159, 2008

work page 2008

[26] [26]

Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making

Gabriel Tillman, Trish Van Zandt, and Gordon D Logan. Sequential sampling models without random between-trial variability: The racing diffusion model of speeded decision making. Psychonomic Bulletin & Review, 27(5):911–936, 2020

work page 2020

[27] [27]

Stan Reference Manual, 2025

Stan Development Team. Stan Reference Manual, 2025. URL https://mc-stan.org/. version 2.32.2

work page 2025

[28] [28]

RStan: the R interface to Stan, 2025

Stan Development Team. RStan: the R interface to Stan, 2025. URL https://mc-stan.org/. R package version 2.32.7

work page 2025

[29] [29]

International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

Eurostat. International extra-eu air passenger transport by reporting country and partner world regions and countries, doi:10.2908/avia_paexcc, 2022

work page doi:10.2908/avia_paexcc 2022

[30] [30]

Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022

Eurostat. Household debt, consolidated including Non-profit institutions serving households - % of GDP, doi:10.2908/TIPSD22, 2022. 6

work page doi:10.2908/tipsd22 2022

[31] [31]

Real gdp per capita, doi:10.2908/SDG_08_10, 2022

Eurostat. Real gdp per capita, doi:10.2908/SDG_08_10, 2022

work page doi:10.2908/sdg_08_10 2022

[32] [32]

Density estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

Mish: A self regularized non-monotonic activation function

Diganta Misra. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681, 2019

work page arXiv 1908

[34] [34]

Deep sets

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017

work page 2017

[35] [35]

Neural spline flows

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. Advances in neural information processing systems, 32, 2019

work page 2019

[36] [36]

Neural importance sampling

Thomas Müller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Novák. Neural importance sampling. ACM Transactions on Graphics (ToG), 38(5):1–19, 2019. 7 Appendix A The marginal likelihood and self-consistency We start with the Bayes’ theorem for parameter posterior given data, p(θk | y, Mk) = p(θk | Mk) × p(y | θk, Mk) p(y | Mk) , (8) where th...

work page 2019