pith. sign in

arxiv: 2505.14429 · v5 · submitted 2025-05-20 · 🧬 q-bio.QM

Compositional amortized inference for large-scale hierarchical Bayesian models

Pith reviewed 2026-05-22 14:19 UTC · model grok-4.3

classification 🧬 q-bio.QM
keywords amortized Bayesian inferencecompositional score matchinghierarchical modelsdiffusion modelserror dampingmicroscopy inverse problemsimulation-based inference
0
0 comments X

The pith

Error-damping estimator stabilizes compositional score matching for hierarchical models with over 750,000 parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends compositional score matching for amortized Bayesian inference to hierarchical models by adding an error-damping estimator. This change fixes instability that arises when many individual data-point updates are combined in diffusion-model approximations. The method stays stable on controlled tests with up to 100,000 points and runs competitive inference on hierarchical autoregressive models while using less than one full simulation for the largest cases. It is then applied to a real inverse problem in advanced microscopy that involves more than 750,000 parameters. The result shows that divide-and-conquer Bayesian updating can now handle the scale of typical scientific datasets without requiring exhaustive joint simulations.

Core claim

The central claim is that an error-damping correction applied inside compositional score matching removes the numerical instability that previously limited aggregation of many data points, while still recovering accurate posterior approximations. This enables amortized inference on hierarchical models whose joint simulation would otherwise be prohibitive. The paper verifies the fix first on synthetic benchmarks and then on a fluorescence-microscopy inverse problem whose parameter count exceeds 750,000.

What carries the argument

The error-damping estimator inside compositional score matching, which rescales or corrects the aggregated score estimates to prevent error accumulation across large numbers of observations.

If this is right

  • Numerical stability holds for datasets containing up to 100,000 points on controlled benchmarks.
  • Competitive posterior accuracy is obtained on hierarchical autoregressive models while consuming fewer than one full joint simulation.
  • The same procedure inverts a real microscopy problem whose dimension exceeds 750,000 parameters.
  • Compositional amortized inference therefore becomes practical for hierarchical models whose direct simulation is computationally prohibitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The damping technique may transfer to other simulation-based inference pipelines that rely on score or gradient aggregation.
  • Similar corrections could be tested on hierarchical models in fields such as systems biology or neuroimaging where data volumes are comparably large.
  • The method invites direct comparison against non-compositional baselines on the same 750,000-parameter microscopy task to quantify the exact simulation savings.

Load-bearing premise

The error-damping estimator continues to preserve statistical accuracy, not just numerical stability, when the underlying diffusion approximations are applied to real noisy scientific measurements.

What would settle it

Parameter recovery on the microscopy dataset deviates markedly from independent reference estimates or ground-truth values once the number of aggregated points exceeds a few tens of thousands.

Figures

Figures reproduced from arXiv: 2505.14429 by Catherine Sherry, Jan Hasenauer, Jonas Arruda, Margarida Barroso, Stefan T. Radev, Vikas Pandey, Xavier Intes.

Figure 1
Figure 1. Figure 1: Compositional inference for hierarchical Bayesian models. Overview of our training procedure (left) and inference stages (right) for amortized hierarchical Bayesian modeling. Amortized posterior sampling uses our error-damping compositional score estimator to achieve rapid inference on very high-dimensional hierarchical problems. Despite the conceptual appeal of CSM, we observe that current aggregation met… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of the error-damping estimator for the Gaussian toy example. Different evaluation metrics are shown for different dataset sizes and damping factors d1 or cosine shifts s. The mini-batch size was set to 10% of the dataset size, and for each step, 10 runs were performed. The median and median absolute deviation are reported, besides for those runs in which none converged. (e.g., recurrent networks… view at source ↗
Figure 3
Figure 3. Figure 3: Assessing inference for high-resolution grids (128×128). A Global parameter recovery across 100 datasets, showing the posterior median and median absolute deviation. B Posterior calibration plot for the global parameters using SBC (Säilynoja et al., 2022). We scale the number of observations up to 100,000 to test the effect of dataset size on the error accumulation of the individual scores. Below, we summa… view at source ↗
Figure 4
Figure 4. Figure 4: Inference for fluorescence lifetime imaging. A Mean intensity across time for each pixel, representing the fluorescence data. B Time series data and fitted posterior median for representative pixels. C Spatial map of the fitted local posteriors (medians) per pixel. D Spatial map of R2 for each pixel, comparing our results with a flat Bayesian model and a popular baseline (MLE). In summary, our experiment w… view at source ↗
Figure 5
Figure 5. Figure 5: Assessing the adaptive sampling scheme for compositional inference in the toy model. (a) Increasing numbers of sampling steps are needed for increasing number of subsets of groups. (b) The adaptive step size is adaptively increased towards the end of the sampling (low noise region). 10 1 10 3 10 5 Data Size 10 1 10 2 10 3 10 4 Number of Steps Max Steps 10 1 10 3 10 5 Data Size 10 1 10 2 10 3 KL Divergence … view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of the error-damping estimator for the toy model. Different evaluation metrics are shown for different mini-batch sizes or varying numbers of subsets of groups. For each experiment, 10 runs were performed. The median and median absolute deviation are reported, besides for those runs where none converged. 10 1 10 3 10 5 Data Size 10 1 10 2 10 3 10 4 Number of Steps Max Steps 10 1 10 3 10 5 Data S… view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of the linear noise schedules for the toy model. For each experiment, 10 runs were performed. The median and median absolute deviation is reported, besides for those runs where none converged. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of the error-damping estimator for the hierarchical AR(1) model. For each experiment, 10 runs were performed. The median and median absolute deviation is reported, besides for those runs where none converged. A mini-batch size of 10% of the data was employed, and score models were trained on a single group. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Assessing inference of global parameters for the FLI model. Synthetic data on a 32×32 grid was generated. 0.2 0.4 0.6 0.8 1.0 1.2 Ground truth 0.2 0.4 0.6 0.8 1.0 1.2 Estimate r = 0.908 L 1 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 Ground truth 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 r = 0.987 L 2 0.3 0.4 0.5 0.6 0.7 0.8 Ground truth 0.3 0.4 0.5 0.6 0.7 0.8 r = 0.973 A L (a) Recovery of transformed local par… view at source ↗
Figure 10
Figure 10. Figure 10: Assessing inference of local parameters for the FLI model. Synthetic data on a 32×32 grid was generated. We compared our hierarchical approach against the standard non-hierarchical pixel-wise MLE. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Assessing inference of local parameters for the FLI model on real data. We compared our hierarchical approach with the standard non-hierarchical pixel-wise MLE. Owing to the low photon count, the average lifetime τ mean is the most reliable quantity for this non-hierarchical method. Furthermore, we show additional random simulations from the hierarchical posterior (median and 95% confidence region out of … view at source ↗
Figure 12
Figure 12. Figure 12: Global posteriors for the real FLI data. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Amortized Bayesian inference (ABI) with neural networks has emerged as a powerful simulation-based approach for estimating complex mechanistic models. However, extending ABI to hierarchical models, a cornerstone of modern Bayesian analysis, has been a major hurdle due to the need to simulate and process massive datasets. Our study tackles these challenges by extending compositional score matching (CSM), a divide-and-conquer strategy for Bayesian updating using diffusion models. We develop a new error-damping estimator to address previous stability issues of CSM when aggregating large numbers of data points. We first verified the numerical stability with up to 100,000 data points on a controlled benchmark. We then evaluated our method on a hierarchical AR model, achieving competitive performance to direct ABI baselines on smaller problem sizes while using less than one full model simulation for larger problem sizes. Finally, we address a large-scale inverse problem in advanced microscopy with over 750,000 parameters, demonstrating its relevance to real scientific applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript extends compositional score matching (CSM) for amortized Bayesian inference to large-scale hierarchical models by introducing an error-damping estimator that mitigates stability issues during aggregation of many data points. It reports numerical stability verification on controlled benchmarks with up to 100,000 points, competitive performance to direct ABI baselines on a hierarchical AR model while using fewer than one full simulation for larger sizes, and a demonstration on an advanced microscopy inverse problem with over 750,000 parameters.

Significance. If the error-damping estimator is shown to preserve statistical accuracy in addition to numerical stability, the approach could enable practical amortized inference for complex hierarchical models in data-intensive scientific domains such as microscopy, offering substantial computational savings over direct methods for problems with hundreds of thousands of parameters.

major comments (2)
  1. [Abstract / large-scale inverse problem demonstration] Abstract and microscopy demonstration: the central claim that the error-damping estimator addresses stability while preserving correctness is load-bearing, yet the 750,000-parameter inverse problem is presented only as a relevance demonstration without reported metrics (posterior mean error, coverage, or comparison to a subsampled direct baseline) under the actual noise model of the microscopy data.
  2. [Numerical stability verification and hierarchical AR evaluation] Benchmark and AR model sections: the effect of the error-damping strength on the statistical properties of the aggregated posterior (bias, variance, or calibration) is not explicitly quantified, leaving open the possibility that stability gains come at the cost of under-dispersion or systematic bias when diffusion-model approximations encounter non-synthetic likelihoods.
minor comments (2)
  1. [Methods] Clarify how the free parameter for error-damping strength is selected or tuned across experiments, including any sensitivity analysis.
  2. [Figures] Ensure all figure captions explicitly state the number of data points, parameter count, and whether results are on synthetic or real data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of our work. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / large-scale inverse problem demonstration] Abstract and microscopy demonstration: the central claim that the error-damping estimator addresses stability while preserving correctness is load-bearing, yet the 750,000-parameter inverse problem is presented only as a relevance demonstration without reported metrics (posterior mean error, coverage, or comparison to a subsampled direct baseline) under the actual noise model of the microscopy data.

    Authors: We agree that the microscopy demonstration is presented without quantitative metrics such as posterior mean error or coverage, and that a direct comparison to a subsampled baseline is absent. This section is explicitly framed as a relevance demonstration to show applicability to a real scientific problem at a scale where direct methods become intractable. We will revise the abstract and the demonstration section to more clearly state these limitations and emphasize that statistical accuracy claims rest on the controlled benchmarks and hierarchical AR evaluations rather than the microscopy example. revision: partial

  2. Referee: [Numerical stability verification and hierarchical AR evaluation] Benchmark and AR model sections: the effect of the error-damping strength on the statistical properties of the aggregated posterior (bias, variance, or calibration) is not explicitly quantified, leaving open the possibility that stability gains come at the cost of under-dispersion or systematic bias when diffusion-model approximations encounter non-synthetic likelihoods.

    Authors: We acknowledge that the manuscript does not include an explicit sensitivity analysis varying the error-damping strength and reporting its effects on bias, variance, or calibration. The existing evaluations show numerical stability up to 100,000 points and competitive performance versus direct ABI baselines on the AR model. We will add a targeted analysis (in the main text or supplement) that quantifies these statistical properties across a range of damping strengths on the synthetic benchmarks to address this concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: new estimator introduced and externally verified

full rationale

The paper presents a new error-damping estimator as an extension to compositional score matching (CSM) for handling large numbers of data points in hierarchical amortized Bayesian inference. Numerical stability is checked on a controlled benchmark with up to 100,000 points, performance is compared to direct ABI baselines on a hierarchical AR model, and relevance is shown via demonstration on a 750k-parameter microscopy inverse problem. No derivation step reduces by construction to a fitted quantity, self-citation chain, or renamed input; the central technical contribution is independently motivated and tested against external benchmarks rather than being equivalent to its own assumptions or prior fitted values.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that diffusion models can usefully approximate score functions for compositional Bayesian updating, plus at least one tunable damping parameter whose value is chosen or fitted to achieve stability.

free parameters (1)
  • error-damping strength
    A scalar or schedule that controls how much error is suppressed when aggregating sub-problem scores; its value must be selected to maintain stability on large data collections.
axioms (1)
  • domain assumption Diffusion models provide sufficiently accurate score estimates for the sub-problems that arise in compositional score matching.
    The entire divide-and-conquer strategy depends on this approximation quality holding when the number of data points reaches 100,000 or more.

pith-pipeline@v0.9.0 · 5712 in / 1331 out tokens · 43210 ms · 2026-05-22T14:19:23.797919+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Theoretical guidelines for annealed Langevin dynamics in compositional simulation-based inference

    stat.ML 2026-05 unverdicted novelty 7.0

    Derives Wasserstein bounds and explicit hyperparameter tuning rules for annealed Langevin dynamics in compositional score-based SBI, proving Linhart et al. (2026) allows larger steps and fewer total steps than Geffner...

  2. Tokenised Flow Matching for Hierarchical Simulation Based Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    TFMPE combines likelihood factorisation with tokenised flow matching to enable efficient hierarchical SBI from single-site simulations, producing well-calibrated posteriors at lower computational cost on a new benchma...

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers

  1. [1]

    Arruda, Y

    J. Arruda, Y . Schälte, C. Peiter, O. Teplytska, U. Jaehde, and J. Hasenauer. An amortized approach to non-linear mixed-effects modeling based on neural posterior estimation. InInternational Conference on Machine Learning, pages 1865–1901. PMLR,

  2. [2]

    URL https://doi.org/10.1080/ 01621459.2017.1307116

    ISSN 0162-1459. doi: 10.1080/01621459.2017.1285773. J. Boelts, M. Deistler, M. Gloeckler, Á. Tejero-Cantero, J.-M. Lueckmann, G. Moss, P. Steinbach, T. Moreau, F. Muratore, J. Linhart, et al. sbi reloaded: a toolkit for simulation-based inference workflows.Journal of Open Source Software, 10(108):7754,

  3. [3]

    Carpenter, A

    11 Published as a conference paper at ICLR 2026 B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language.Journal of statistical software, 76:1–32,

  4. [4]

    Gelman, A

    A. Gelman, A. Vehtari, D. Simpson, C. C. Margossian, B. Carpenter, Y . Yao, L. Kennedy, J. Gabry, P.-C. Bürkner, and M. Modrák. Bayesian workflow.arXiv preprint arXiv:2011.01808,

  5. [5]

    doi: 10.1111/j.1467-9868.2007.00587.x. D. Habermann, M. Schmitt, L. Kühmichel, A. Bulling, S. T. Radev, and P.-C. Bürkner. Amortized bayesian multilevel models.CoRR, abs/2408.13230,

  6. [6]

    Gotta go fast when generating data with score-based models,

    A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas. Gotta go fast when generating data with score-based models.arXiv preprint arXiv:2105.14080,

  7. [7]

    Z. Li, H. Yuan, K. Huang, C. Ni, Y . Ye, M. Chen, and M. Wang. Diffusion model for data-driven black-box optimization.arXiv preprint arXiv:2403.13219,

  8. [8]

    Linhart, G

    12 Published as a conference paper at ICLR 2026 J. Linhart, G. Cardoso, A. Gramfort, S. L. Corff, and P. L. C. Rodrigues. Diffusion posterior sampling for simulation-based inference in tall data settings.Transactions on Machine Learning Research,

  9. [9]

    13 Published as a conference paper at ICLR 2026 J. T. Smith, R. Yao, N. Sinsuebphon, A. Rudkouskaya, N. Un, J. Mazurkiewicz, M. Barroso, P. Yan, and X. Intes. Fast fit-free analysis of fluorescence lifetime imaging via deep learning.Proceedings of the national academy of sciences, 116(48):24019–24030,

  10. [10]

    doi: 10.1364/opticaopen.28094186.v1. M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. Advances in neural information processing systems, 30,

  11. [11]

    Zhang and L

    Y . Zhang and L. Mikelsons. Solving stochastic inverse problems with stochastic BayesFlow. In 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), pages 966–972,

  12. [12]

    (2021b): dθt =f(θ t, t) dt+g(t) dW t

    14 Published as a conference paper at ICLR 2026 A APPENDIX A.1 STOCHASTIC DIFFERENTIAL EQUATION FORMULATION OF THE DIFFUSION PROCESS The forward diffusion process for t∈[0,1] can be specified as a stochastic differential equation Song et al. (2021b): dθt =f(θ t, t) dt+g(t) dW t. For a known variance-preserving process, the drift and diffusion coefficients...

  13. [13]

    • Time series summary network:For structured input data such as time series (as in the FLI application), we use a hybrid convolutional–recurrent architecture

    and ReLU activations, projecting to the final output dimension. • Time series summary network:For structured input data such as time series (as in the FLI application), we use a hybrid convolutional–recurrent architecture. The model begins with a stack of 1D convolutional layers followed by a skipping recurrent path, as implemented in (Zhang and Mikelsons...

  14. [14]

    We parameterize our score models to predict the more stable velocity ˆvt :=α tϵ−σ tθt, and then transform the output to noise ˆϵt, as it has been shown that this parameterization is more stable for all t, whereas noise-prediction becomes harder for t close to 0 where the signal increases and noise decreases Salimans and Ho (2022). Furthermore, we conditio...

  15. [15]

    We observe {Yj}J j=1 with varying J and compute the posterior p(η| {Y i}J j=1). Given a normal prior for η, η∼ N(0|σ 2I), the posterior is also Gaussian, and we can calculate it analytically: p(η| {Y j}J j=1)∝exp − 1 2 (η−µ J)⊤Σ−1 J (η−µ J) , whereµ J = 1 J+1 PJ j=1 Yj andΣ −1 J = J+1 σ2 I. Here, we did not employ a summary network. 18 Published as a conf...

  16. [16]

    We used 4 parallel chains, each generating 1,000 samples with default settings in Stan

    performs better on non-centered parameterizations (Betancourt and Girolami, 2015). We used 4 parallel chains, each generating 1,000 samples with default settings in Stan. Here, we do not employ a summary network. For the direct hierarchical ABI methods (Heinrich et al., 2024; Habermann et al., 2024), we employ • ABI-NF: Normalizing flow with 2 coupling la...

  17. [17]

    Convergence is achieved only for the smallest dataset

    101 103 Data Size 101 102 103 104 Number of Steps Max Steps 101 103 Data Size 0.0 0.5 1.0RMSE Global 101 103 Data Size 0.0 0.2 0.4 Calibration Error Global 101 103 Data Size 0.0 0.5 1.0Contraction Global 101 103 Data Size 0.0 0.5 1.0RMSE Local 101 103 Data Size 0.0 0.5 1.0Contraction Local 1 10 16 100 256 4096 16384 (b) Varying mini-batch sizes. Convergen...

  18. [18]

    The real data were also normalized to 1 on a pixel-wise level. Instrument response function (IRF)The emitted signals are recorded using multiple instruments (detectors, electronics, etc.) which have a characteristic response E(t) to an instantaneous signal δ(t) (e.g., a single photon). The recorded signals from the T -periodic emitted signal can be writte...

  19. [19]

    Here, we employed a time-series summary network. For comparison, we also trained a diffusion model of the same size as ours on the flat model using the same prior and simulation budget, but only targeting the local per pixel parameters without conditioning on global parameters. DataAU565 (HER2+ human breast carcinoma) cells, incubated for 24h with 20 µg/m...