pith. sign in

arxiv: 2605.15407 · v2 · pith:4HXAPRVFnew · submitted 2026-05-14 · 🧮 math.NA · cs.AI· cs.NA

Amortized Energy-Based Bayesian Inference

Pith reviewed 2026-05-20 19:53 UTC · model grok-4.3

classification 🧮 math.NA cs.AIcs.NA
keywords amortized inferencetransport mapsBayesian inverse problemsenergy distanceneural operatorsposterior approximationlikelihood free
0
0 comments X

The pith

A transport map learned from joint samples approximates posteriors for repeated Bayesian inference in nonlinear inverse problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an amortized approach to Bayesian inference for nonlinear inverse problems, where the same inference task must be solved for many different observations. Rather than using MCMC to solve a new problem each time, it learns a map that takes an observation and pushes a reference distribution to approximate the corresponding posterior. Training minimizes an averaged energy-distance objective using only samples from the joint distribution of parameters and observations, making the method likelihood-free. In function-space settings with Gaussian priors, the map is parameterized as the identity plus a perturbation in the Cameron-Martin space to maintain absolute continuity, with neural operators used for the infinite-dimensional representation. Demonstrations on a finite-dimensional example and two PDE inverse problems show that the map recovers multimodal structure and supports fast sampling for unseen observations.

Core claim

The central claim is that an observation-dependent transport map, obtained by minimizing the average energy distance to the true posterior pushforward, can be learned from joint samples alone and then used to generate approximate posterior samples for new observations in both finite- and infinite-dimensional nonlinear inverse problems.

What carries the argument

The learned observation-dependent transport map, which pushes a reference measure forward to approximate the posterior and is trained via the averaged energy-distance objective.

If this is right

  • The learned map enables rapid posterior sampling for new observations without resolving a full inference problem each time.
  • The approach works in likelihood-free settings requiring only joint samples from parameters and observations.
  • Parameterization via Cameron-Martin perturbations ensures the map preserves absolute continuity with respect to Gaussian priors in function space.
  • Neural operator representations allow the method to handle infinite-dimensional PDE-constrained inverse problems.
  • Posterior structure including multimodality and dominant modes is captured in the learned approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the energy-distance minimization succeeds, the method could be applied to sequential data assimilation where observations arrive over time.
  • Similar amortization ideas might extend to other sampling-based inference tasks beyond inverse problems.
  • Replacing energy distance with alternative metrics could be explored for improved performance in specific applications.
  • Validation on additional inverse problems would test the generality of the transport map parameterization.

Load-bearing premise

That the transport map obtained by minimizing the averaged energy-distance objective provides a sufficiently close approximation to the true posterior for practical use in the target applications.

What would settle it

Running independent MCMC on a new observation and comparing the resulting samples or statistics to those generated by the trained transport map; large discrepancies would indicate the learned approximation is inaccurate.

read the original abstract

We consider amortized Bayesian inference for nonlinear inverse problems in settings where only samples from the joint distribution of parameters and observations are available. Classical methods such as Markov chain Monte Carlo require solving a new inference problem for each observation, which can be computationally prohibitive when inference must be repeated many times. We propose a transport-based approach that learns an observation-dependent map pushing forward a reference measure to approximate the posterior distribution. The map is trained by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward. This formulation is likelihood-free, requiring only joint samples, and avoids density evaluation, invertibility constraints, and Jacobian determinant computations. For function-space inverse problems with Gaussian priors, we parameterize the transport map as the identity plus a perturbation in the Cameron-Martin space of the prior, preserving absolute continuity with respect to the prior. In infinite-dimensional settings, the map is represented using neural operators. We illustrate the method on a finite-dimensional nonlinear inverse problem and two PDE-constrained inverse problems arising in porous medium flow and seismic inversion. The results show that the learned transport captures posterior structure, including multimodality and dominant modes, while enabling fast posterior sampling for new observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an amortized Bayesian inference framework for nonlinear inverse problems that learns an observation-dependent transport map pushing a reference measure forward to approximate the posterior. The map is trained by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward, using only joint samples from p(θ, y). The approach is likelihood-free, avoids density evaluations and Jacobian computations, and for function-space problems with Gaussian priors parameterizes the map as the identity plus a Cameron-Martin perturbation, represented via neural operators in infinite dimensions. The method is illustrated on one finite-dimensional nonlinear inverse problem and two PDE-constrained problems (porous medium flow and seismic inversion), with claims that the learned map captures multimodality and dominant modes while enabling fast sampling for new observations.

Significance. If the training procedure can be made executable and the resulting approximations are shown to be accurate, the work would offer a practical advance in amortized inference for settings where repeated posterior sampling is needed and classical MCMC is too slow. The energy-distance formulation and the structure-preserving parameterization for infinite-dimensional problems are technically interesting, and the qualitative demonstrations on multimodality provide initial evidence of utility, though quantitative validation would strengthen the case.

major comments (2)
  1. [Training Objective (abstract and §3)] The central training procedure (minimization of the averaged energy-distance objective between p(θ|y) and the learned pushforward) is described as requiring only joint samples, yet no explicit estimator, resampling scheme, or reformulation is provided that would allow computation of the energy distance for fixed y. Joint samples from p(θ, y) typically yield at most one θ per distinct y in continuous settings, which is insufficient to estimate the posterior or the distance without additional machinery; this renders the stated objective non-executable as described and directly undermines the likelihood-free claim.
  2. [Numerical Experiments (§5)] The numerical results rely exclusively on qualitative visualizations of captured multimodality and dominant modes without any quantitative error metrics, convergence diagnostics, or comparisons to ground-truth posteriors (e.g., Wasserstein distances, effective sample sizes, or posterior coverage). This absence makes it impossible to assess whether the learned map provides a sufficiently accurate approximation for the target inverse-problem applications.
minor comments (2)
  1. [§2] Notation for the energy distance and the averaging over observations could be introduced more explicitly with an equation number to improve readability.
  2. [Abstract] The abstract states results on 'three example problems' while the body describes one finite-dimensional case plus two PDE-constrained cases; a brief clarifying sentence would avoid minor confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Training Objective (abstract and §3)] The central training procedure (minimization of the averaged energy-distance objective between p(θ|y) and the learned pushforward) is described as requiring only joint samples, yet no explicit estimator, resampling scheme, or reformulation is provided that would allow computation of the energy distance for fixed y. Joint samples from p(θ, y) typically yield at most one θ per distinct y in continuous settings, which is insufficient to estimate the posterior or the distance without additional machinery; this renders the stated objective non-executable as described and directly undermines the likelihood-free claim.

    Authors: We agree that the manuscript requires a more explicit description of the Monte Carlo estimator used to approximate the averaged energy-distance objective. The current text states that the approach requires only joint samples but does not detail the finite-sample procedure, including how expectations are formed over batches of observations and how multiple pushforward samples are drawn for each fixed y. In the revised manuscript we will add a dedicated subsection in §3 that presents the empirical estimator, specifies the batching strategy over joint samples, and clarifies that the energy-distance terms involving the learned map are estimated by repeated sampling from the reference measure through the map while the cross terms are estimated from the available joint pairs. We will also note any practical requirements for generating sufficiently many map samples per observation to obtain stable estimates. revision: yes

  2. Referee: [Numerical Experiments (§5)] The numerical results rely exclusively on qualitative visualizations of captured multimodality and dominant modes without any quantitative error metrics, convergence diagnostics, or comparisons to ground-truth posteriors (e.g., Wasserstein distances, effective sample sizes, or posterior coverage). This absence makes it impossible to assess whether the learned map provides a sufficiently accurate approximation for the target inverse-problem applications.

    Authors: We acknowledge that the present numerical section emphasizes qualitative illustrations of multimodality and mode capture. While these visualizations are useful for demonstrating the method’s qualitative behavior on the chosen examples, we agree that quantitative metrics would strengthen the evaluation. In the revised manuscript we will augment §5 with quantitative assessments: Wasserstein distances to reference posteriors on the finite-dimensional nonlinear problem (where ground truth can be obtained by long-run MCMC), posterior coverage probabilities, and effective sample size comparisons against independent MCMC runs for the PDE-constrained examples. We will also report training and inference wall-clock times to quantify the amortization benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: training objective and transport map defined directly from joint samples without reduction to inputs

full rationale

The paper proposes learning an observation-dependent transport map by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward, explicitly using only joint samples from p(θ, y). This objective is stated as the training criterion without any reduction to a fitted parameter renamed as a prediction, self-definitional loop, or load-bearing self-citation for uniqueness. The infinite-dimensional parameterization (identity plus Cameron-Martin perturbation, neural operators) is given explicitly as an implementation choice. Claims about capturing multimodality are presented as empirical outcomes of the method rather than tautological derivations. The derivation chain is self-contained, with the method's executability resting on external estimation of the energy distance from joint samples rather than any internal circular equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard optimal-transport and Bayesian-inference assumptions plus the modeling choice of representing the map as identity plus Cameron-Martin perturbation; no new entities are postulated.

axioms (2)
  • domain assumption A transport map exists that pushes a reference measure to the target posterior
    Invoked when the method is introduced as learning an observation-dependent map
  • domain assumption The energy-distance objective can be minimized to yield a useful posterior approximation
    Central to the training formulation described in the abstract

pith-pipeline@v0.9.0 · 5737 in / 1329 out tokens · 40059 ms · 2026-05-20T19:53:38.016309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

  1. [1]

    Arridge, P

    S. Arridge, P. Maass, O. ¨Oktem, and C.-B. Sch ¨onlieb, Solving inverse problems using data-driven models , Acta Numerica, 28 (2019), pp. 1–174, https://doi.org/10.1017/S0962492919000059, https://www.cambridge.org/core/ journals/acta-numerica/article/solving-inverse-problems-using-datadriven-models/ CE5B3725869AEAF46E04874115B0AB15?utm source=chatgpt.com ...

  2. [2]

    E. Bach, R. Baptista, D. Sanz-Alonso, and A. Stuart , Machine Learning for Inverse Problems and Data Assimilation , Oct. 2025, https://doi.org/10.48550/arXiv.2410.10523, http://arxiv.org/abs/2410.10523 (accessed 2025-11-14). arXiv:2410.10523 [stat]

  3. [3]

    Baptista, B

    R. Baptista, B. Hosseini, N. B. Kovachki, and Y. M. Marzouk , Conditional sampling with monotone GANs: From generative models to likelihood-free inference, SIAM/ASA Journal on Uncertainty Quantification, 12 (2024), pp. 868–900

  4. [4]

    Baptista, Y

    R. Baptista, Y. Marzouk, and O. Zahm , On the representation and learning of mono- tone triangular transport maps , Foundations of Computational Mathematics, 24 (2024), pp. 2063–2108

  5. [5]

    Baptista, A.-A

    R. Baptista, A.-A. Pooladian, M. Brennan, Y. Marzouk, and J. Niles-Weed , Condi- tional simulation via entropic optimal transport: Toward non-parametric estimation of conditional Brenier maps, in International Conference on Artificial Intelligence and Statis- tics, PMLR, 2025, pp. 4807–4815

  6. [6]

    Bogachev , Gaussian Measures, vol

    V. Bogachev , Gaussian Measures, vol. 62 of Mathematical Surveys and Monographs, Ameri- can Mathematical Society, Providence, Rhode Island, Sept. 1998, https://doi.org/10.1090/ surv/062, https://www.ams.org/surv/062 (accessed 2026-04-28)

  7. [7]

    V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev , Triangular trans- formations of measures , Sbornik: Mathematics, 196 (2005), p. 309, https: //doi.org/10.1070/SM2005v196n03ABEH000882, https://iopscience.iop.org/article/ 10.1070/SM2005v196n03ABEH000882/meta (accessed 2025-09-03)

  8. [8]

    Brooks, A

    S. Brooks, A. Gelman, G. Jones, and X.-L. Meng , Handbook of Markov Chain Monte Carlo, CRC press, 2011

  9. [9]

    L. Cao, J. Chen, M. Brennan, T. O’Leary-Roseberry, Y. Marzouk, and O. Ghattas , LazyDINO: Fast, Scalable, and Efficiently Amortized Bayesian Inversion via Structure- Exploiting and Surrogate-Driven Measure Transport , Journal of Machine Learning Re- search, 27 (2026), pp. 1–71, http://jmlr.org/papers/v27/25-0858.html (accessed 2026-04- 08)

  10. [10]

    S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White , MCMC Meth- ods for Functions: Modifying Old Algorithms to Make Them Faster , Sta- tistical Science, 28 (2013), pp. 424–446, https://doi.org/10.1214/13-STS421, https://projecteuclid.org/journals/statistical-science/volume-28/issue-3/ MCMC-Methods-for-Functions--Modifying-Old-Algorithms-to-Make/10.12...

  11. [11]

    Dashti and A

    M. Dashti and A. M. Stuart , The Bayesian Approach to Inverse Problems , in Hand- book of Uncertainty Quantification, Springer, Cham, 2017, pp. 311–428, https://doi.org/ 10.1007/978-3-319-12385-1 7, https://link.springer.com/rwe/10.1007/978-3-319-12385-1 7 (accessed 2025-07-23)

  12. [12]

    X. Huan, J. Jagalur, and Y. Marzouk , Optimal experimental design: For- mulations and computations , Acta Numerica, 33 (2024), pp. 715–840, https: //doi.org/10.1017/S0962492924000023, https://www.cambridge.org/core/journals/ acta-numerica/article/optimal-experimental-design-formulations-and-computations/ AMORTIZED ENERGY-BASED BAYESIAN INFERENCE 25 38BBD0...

  13. [13]

    Karumuri and I

    S. Karumuri and I. Bilionis , Learning to solve Bayesian inverse problems: An amortized variational inference approach using Gaussian and Flow guides , Journal of Computational Physics, 511 (2024), p. 113117, https://doi.org/10.1016/j.jcp.2024.113117, http://arxiv. org/abs/2305.20004 (accessed 2026-05-03). arXiv:2305.20004 [stat]

  14. [14]

    Kaveh, J

    H. Kaveh, J. P. A vouac, and A. M. Stuart , Data assimilation in machine-learned reduced-order model of chaotic earthquake sequences , Geophysical Journal International, 244 (2026), p. ggaf518, https://doi.org/10.1093/gji/ggaf518, https://doi.org/10.1093/gji/ ggaf518 (accessed 2026-04-08)

  15. [15]

    Kaveh, P

    H. Kaveh, P. Batlle, M. Acosta, P. Kulkarni, S. J. Bourne, and J. P. A vouac , Induced Seismicity Forecasting with Uncertainty Quantification: Application to the Groningen Gas Field, Seismological Research Letters, 95 (2023), pp. 773–790, https://doi.org/10.1785/ 0220230179, https://doi.org/10.1785/0220230179 (accessed 2025-04-04)

  16. [16]

    Kaveh, O

    H. Kaveh, O. Dunbar, J.-P. A vouac, and A. M. Stuart , Bayesian Calibration of dynamic models of earthquake sequences using observations from past large earthquakes , (2026), https://eartharxiv.org/repository/view/12419/ (accessed 2026-04-08)

  17. [17]

    Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar , Fourier Neural Operator for Parametric Partial Differential Equations , May 2021, https://doi.org/10.48550/arXiv.2010.08895, http://arxiv.org/abs/2010.08895 (accessed 2024-03-12). arXiv:2010.08895 [cs, math]

  18. [18]

    An introduction to sampling via measure transport

    Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini , An introduction to sampling via measure transport, 2016, pp. 1–41, https://doi.org/10.1007/978-3-319-11259-6 23-1, http: //arxiv.org/abs/1602.05023 (accessed 2026-04-27). arXiv:1602.05023 [stat]

  19. [19]

    T. A. E. Moselhy and Y. M. Marzouk , Bayesian Inference with Optimal Maps , Journal of Computational Physics, 231 (2012), pp. 7815–7850, https://doi.org/10.1016/j.jcp.2012.07. 022, http://arxiv.org/abs/1109.1516 (accessed 2026-04-27). arXiv:1109.1516 [stat]

  20. [20]

    Mousavi and J

    H. Mousavi and J. D. Eldredge , Bayesian Inference for Estimating Heat Sources Through Temperature Assimilation, ASME Journal of Heat and Mass Transfer, 147 (2024), https: //doi.org/10.1115/1.4066749, https://doi.org/10.1115/1.4066749 (accessed 2026-04-08)

  21. [21]

    Papamakarios , Neural density estimation and likelihood-free inference , arXiv preprint arXiv:1910.13233, (2019)

    G. Papamakarios , Neural density estimation and likelihood-free inference , arXiv preprint arXiv:1910.13233, (2019)

  22. [22]

    Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows

    G. Papamakarios, D. C. Sterratt, and I. Murray , Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows , Jan. 2019, https://doi. org/10.48550/arXiv.1805.07226, http://arxiv.org/abs/1805.07226 (accessed 2026-04-28). arXiv:1805.07226 [stat]

  23. [23]

    S. T. Radev, U. K. Mertens, A. Voss, L. Ardizzone, and U. K ¨othe, BayesFlow: Learning complex stochastic models with invertible neural networks , Mar. 2020, https://arxiv.org/ abs/2003.06281v4 (accessed 2026-04-28)

  24. [24]

    A. M. Stuart , Inverse problems: A Bayesian perspective , Acta Numerica, 19 (2010), pp. 451–559, https://doi.org/10.1017/S0962492910000061, https://www.cambridge. org/core/journals/acta-numerica/article/abs/inverse-problems-a-bayesian-perspective/ 587A3A0D480A1A7C2B1B284BCEDF7E23 (accessed 2026-04-28)

  25. [25]

    Taghvaei and B

    A. Taghvaei and B. Hosseini , An optimal transport formulation of Bayes’ law for nonlinear filtering algorithms, in 2022 IEEE 61st Conference on Decision and Control (CDC), IEEE, 2022, pp. 6608–6613

  26. [26]

    Inverse Problem Theory and Methods for Model Parameter Estimation

    A. Tarantola , Inverse Problem Theory and Methods for Model Parameter Estimation , Other Titles in Applied Mathematics, Society for Industrial and Applied Mathematics, Jan. 2005, https://doi.org/10.1137/1.9780898717921, https://epubs.siam.org/doi/book/ 10.1137/1.9780898717921 (accessed 2026-04-27)

  27. [27]

    , year =

    C. Villani , Optimal Transport, vol. 338 of Grundlehren der mathematischen Wissenschaften, Springer, Berlin, Heidelberg, 2009, https://doi.org/10.1007/978-3-540-71050-9, http:// link.springer.com/10.1007/978-3-540-71050-9 (accessed 2026-04-27)

  28. [28]

    Wildberger, M

    J. Wildberger, M. Dax, S. Buchholz, S. Green, J. H. Macke, and B. Sch ¨olkopf, Flow matching for scalable simulation-based inference, Advances in Neural Information Process- ing Systems, 36 (2023), pp. 16837–16864