How to Approximate Inference with Subtractive Mixture Models

Antonio Vergari; Lena Zellinger; Lennert De Smet; Nicola Branchini; Nikolay Malkin; V\'ictor Elvira

arxiv: 2604.16714 · v1 · submitted 2026-04-17 · 💻 cs.LG · stat.CO· stat.ML

How to Approximate Inference with Subtractive Mixture Models

Lena Zellinger , Nicola Branchini , Lennert De Smet , V\'ictor Elvira , Nikolay Malkin , Antonio Vergari This is my paper

Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3

classification 💻 cs.LG stat.COstat.ML

keywords subtractive mixture modelsvariational inferenceimportance samplingapproximate inferenceexpectation estimationdistribution approximationnegative coefficients

0 comments

The pith

Subtractive mixture models can be used for variational inference and importance sampling by designing special expectation estimators and learning schemes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how subtractive mixture models, which allow negative coefficients unlike classical mixtures, can still serve as proposals in approximate inference. Without latent variables, standard sampling does not apply, so the authors develop custom estimators for computing expectations in importance sampling and optimization schemes for variational inference. They test these approaches on distribution approximation tasks and outline fixes for stability and efficiency problems that arise. If successful, this would make SMMs a viable, more flexible alternative to positive-coefficient mixtures in inference pipelines.

Core claim

Subtractive mixture models become practical for variational inference and importance sampling once several tailored expectation estimators and learning schemes are introduced to handle the lack of latent-variable semantics.

What carries the argument

Expectation estimators for IS and learning schemes for VI that operate directly on SMM parameters without requiring latent-variable sampling.

If this is right

SMMs can serve as more expressive proposals than classical mixtures in both IS and VI.
The new estimators allow unbiased or reduced-variance expectation estimates under negative coefficients.
Learning procedures enable direct optimization of SMM parameters for variational objectives.
Proposed fixes address instability in estimation and slower convergence during learning.
Empirical results on distribution approximation demonstrate practical feasibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These techniques might extend to other inference settings that currently avoid negative-weight models, such as certain particle filters.
Combining SMMs with gradient-based sampling could further reduce the computational overhead of the new estimators.
Theoretical analysis of approximation error bounds under the proposed schemes would strengthen guarantees for high-dimensional use.
The stability fixes might generalize to other mixture variants that include sign changes.

Load-bearing premise

Suitably designed estimators and schemes can produce stable, efficient approximations even though SMMs lack the latent-variable structure that classical mixtures use for sampling.

What would settle it

A controlled experiment on standard benchmarks where the proposed SMM estimators produce consistently higher variance or divergent learning compared to classical mixture baselines would show the methods do not overcome the missing latent semantics.

Figures

Figures reproduced from arXiv: 2604.16714 by Antonio Vergari, Lena Zellinger, Lennert De Smet, Nicola Branchini, Nikolay Malkin, V\'ictor Elvira.

**Figure 1.** Figure 1: SMMs are effective variational families for targets with disconnected support such as trajectories over the walkable area from the Stanford drone dataset (Robicquet et al., 2016). With just K = 2 components, the SMM learns the absence of density at the central roundabout while a GMM requires K > 2. We lay the foundation to use SMMs for VI. for every input x ∈ R D, where qk are the mixture components and αk… view at source ↗

**Figure 2.** Figure 2: Visual comparison of sampling strategies on a 2D SMM. ARITS directly simulates samples from the ring. Rejection sampling discards many samples since the average acceptance probability in this example is only around 0.137. ∆IS uses samples from both positively and negatively weighted components (depicted in blue and red respectively) to estimate a difference of expectations. All methods are depicted with S … view at source ↗

**Figure 3.** Figure 3: Rejection sampling and ∆IS can achieve comparable estimation quality to ARITS when given sufficient sampling budget, but can be orders of magnitude faster in high dimensions as shown for MC estimation. We depict (mean ± stddev) over 30 instances. Details in §C.2. 6 EXPERIMENTS We now empirically assess how the estimators discussed in this paper perform in three settings: vanilla MC, BBVI, and IS with le… view at source ↗

**Figure 4.** Figure 4: ∆VI requires a higher number of samples than RLOO variants to achieve comparable RKL and FKL. The RKL and FKL values were collected from 10 models learned with a budget of S samples per step. which are very common benchmarks in statistics and VI. In particular, for BLR we use the datasets GermanCredit, BreastCancer, Ionosphere and Sonar from Blessing et al. (2024). In Tab. 3, we report estimated ELBOs fo… view at source ↗

**Figure 6.** Figure 6: A second scenario from the SDD dataset. SMMs (top) and GMMs (bottom) result in very different fits to the target for the same component budget. We once again observe that for K = 2, the SMM effectively learns to use subtraction in order to model the absence of density induced by a constraint. pipeline and use learned SMM proposals for normalizing constant estimation. We use proposals learned via rejectio… view at source ↗

**Figure 7.** Figure 7: A squared mixture can be split into its positive and negative parts as illustrated via its representation as a computational graph, also called circuit (Choi et al., 2020; Loconte et al., 2025a). A SAMPLING ALGORITHMS In this section, we provide further details on the sampling algorithms discussed throughout the paper. Alg. 3 provides the algorithm for ancestral mixture sampling. Alg. 4 covers stratified s… view at source ↗

**Figure 8.** Figure 8: γ(a, S) versus S (log-scale x-axis) for several values of a, the acceptance probability of rejection. The best convergence rate is 1/S, which is almost achieved when a is close to 1. Since Ex∼qSMM [IbRS|K] = I 3 for any K the second term is zero, so expanding the first term (using that accepted samples are i.i.d.), V[IbRS] = Vx∼qSMM [h(x)] · E 1 K In the following, we set γ(S, a) := E[1/K] under the de… view at source ↗

**Figure 9.** Figure 9: Visual comparisons of models obtained with [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: The quality of the variational approximation can be sensitive to the initialization. The figure shows learned models for 5 different initializations for the Hollow target in 16 dimensions. The components are sketched as ellipses with width and height corresponding to their standard deviations. Positively weighted components are illustrated in black and negative components are shown in red. The first colum… view at source ↗

**Figure 11.** Figure 11: A safe component can mitigate the potentially large variance of ∆IS. Depicted is the estimation error, log(|Ib − I|) − log(I), averaged over 100 repeated estimations (mean ± stddev) for various sampling budgets S. Lower is better. Without a safe component (i.e., β = 0), ∆IS can result in high variance and the average estimation error gets worse as the sampling budget increases. Standard UIS estimators do … view at source ↗

**Figure 12.** Figure 12: The trends observed in Tab. 11 hold across various sample sizes. The boxplots summarize the error for normalizing constant estimation over 100 repetitions when using GMM and SMM proposals for varying sampling budgets. Lower is better. For targets on which the SMM achieves a better fit, we see better estimation performance when using the SMM proposal with either rejection or ARITS. On the remaining targets… view at source ↗

**Figure 13.** Figure 13: Some unnormalized bivariate conditionals for the GermanCredit (first row), BreastCancer (second [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗

**Figure 14.** Figure 14: Additional results for the SDD targets. GMMs are shown in the top row, SMMs in the bottom row. [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

read the original abstract

Classical mixture models (MMs) are widely used tractable proposals for approximate inference settings such as variational inference (VI) and importance sampling (IS). Recently, mixture models with negative coefficients, called subtractive mixture models (SMMs), have been proposed as a potentially more expressive alternative. However, how to effectively use SMMs for VI and IS is still an open question as they do not provide latent variable semantics and therefore cannot use sampling schemes for classical MMs. In this work, we study how to circumvent this issue by designing several expectation estimators for IS and learning schemes for VI with SMMs, and we empirically evaluate them for distribution approximation. Finally, we discuss the additional challenges in estimation stability and learning efficiency that they carry and propose ways to overcome them. Code is available at: https://github.com/april-tools/delta-vi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

This paper gives practical estimators and schemes for using subtractive mixture models in importance sampling and variational inference, but evaluates them only on direct density approximation rather than full inference tasks. Subtractive mixtures allow negative coefficients for more flexibility than standard mixtures, yet they lack the latent variables needed for easy sampling. The authors design several expectation estimators for IS and corresponding VI learning procedures to work around that limitation, then run experiments fitting SMMs to target densities. They also flag stability and efficiency issues and suggest mitigations, with code released publicly. That part is straightforward and addresses a real gap left by earlier SMM proposals. The main limitation is the evaluation scope. The tests stay at approximating simple distributions instead of measuring low-variance IS estimates or stable VI performance on harder targets like posteriors. Without those downstream results, it is hard to know whether the new estimators deliver usable gains in actual approximate inference pipelines. The abstract itself notes the extra challenges, so this is not an overclaim but an incomplete picture. Readers working on flexible proposals for probabilistic inference will find the concrete designs useful to build on. The work engages the literature honestly without circular arguments or invented tricks. I would bring it to a reading group to walk through the estimator derivations. It deserves peer review because it supplies missing tools for an existing model class, though referees should push for expanded experiments on end-to-end inference metrics.

Referee Report

2 major / 2 minor

Summary. The paper addresses the challenge of using subtractive mixture models (SMMs) for approximate inference in variational inference (VI) and importance sampling (IS). Unlike classical mixture models, SMMs lack latent-variable semantics and thus cannot rely on standard sampling schemes. The authors propose several expectation estimators for IS and learning schemes for VI with SMMs, empirically evaluate these on distribution approximation tasks, and discuss associated challenges in estimation stability and learning efficiency along with mitigation strategies. Code is provided for reproducibility.

Significance. If the proposed estimators and schemes can be shown to deliver stable, low-variance inference on realistic targets, the work would meaningfully expand the set of tractable yet expressive proposal distributions available for VI and IS. The explicit treatment of stability and efficiency issues, together with open-source code, strengthens the contribution's practical value.

major comments (2)

[Abstract, §5] Abstract and empirical evaluation section: the central claim is that the designed estimators and learning schemes make SMMs 'usable' for VI and IS. However, the reported experiments are restricted to distribution approximation of simple target densities. No results are shown on downstream tasks such as posterior approximation, low-variance IS weight estimation, or convergence of VI on latent-variable models, leaving the usability claim untested.
[§3] §3 (expectation estimators for IS): the paper notes 'additional challenges in estimation stability' but does not provide quantitative comparisons (e.g., effective sample size or variance of the estimators) against standard MM baselines on the same targets, making it difficult to judge whether the new estimators actually circumvent the lack of latent-variable semantics.

minor comments (2)

[§2] Notation for the subtractive coefficients and the resulting density could be clarified with an explicit normalization step or a short derivation showing how the mixture remains a valid density.
[§4] The discussion of learning schemes for VI would benefit from a concise pseudocode listing the gradient estimator and any variance-reduction techniques employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential significance. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract, §5] Abstract and empirical evaluation section: the central claim is that the designed estimators and learning schemes make SMMs 'usable' for VI and IS. However, the reported experiments are restricted to distribution approximation of simple target densities. No results are shown on downstream tasks such as posterior approximation, low-variance IS weight estimation, or convergence of VI on latent-variable models, leaving the usability claim untested.

Authors: We agree that the experiments focus exclusively on distribution approximation of simple targets and do not demonstrate performance on downstream tasks such as posterior approximation or VI convergence on latent-variable models. The manuscript's abstract and §5 explicitly frame the evaluation as testing the estimators and learning schemes for distribution approximation, which directly validates the core mechanisms for handling the absence of latent-variable semantics. This was intended as a foundational step before more complex applications. To strengthen the usability claim, we will revise the manuscript by adding a new experiment on a simple posterior approximation task using SMMs within a VI framework, reporting convergence behavior and IS weight variance. We will also update the abstract to more precisely describe the scope of the current empirical results. revision: yes
Referee: [§3] §3 (expectation estimators for IS): the paper notes 'additional challenges in estimation stability' but does not provide quantitative comparisons (e.g., effective sample size or variance of the estimators) against standard MM baselines on the same targets, making it difficult to judge whether the new estimators actually circumvent the lack of latent-variable semantics.

Authors: The referee correctly identifies that while stability challenges are noted in §3, the manuscript lacks direct quantitative comparisons (such as estimator variance or effective sample size) against standard mixture model baselines on identical targets. Such metrics would better illustrate whether the proposed estimators address the limitations from missing latent-variable semantics. We will revise §3 and the empirical section to include these comparisons, adding tables and plots of variance and ESS for our estimators versus MM baselines on the same target densities used in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity detected; estimators and schemes are independently designed

full rationale

The paper proposes new expectation estimators for IS and learning schemes for VI tailored to SMMs, then evaluates them empirically on distribution approximation tasks. No equations, derivations, or self-citations in the abstract or described content reduce any claimed result to a fitted parameter or prior input by construction. The approach extends standard VI/IS techniques to a new model class without redefining inputs in terms of outputs or relying on load-bearing self-citations for uniqueness. The derivation chain remains self-contained and externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone; no explicit free parameters, axioms, or invented entities are described. The work relies on standard assumptions of approximate inference but does not detail them here.

pith-pipeline@v0.9.0 · 5462 in / 1036 out tokens · 35013 ms · 2026-05-10T08:16:44.278484+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

URLhttps: //doi.org/10.1093/imamat/8.1.80

doi: 10.1093/imamat/8.1.80. URLhttps: //doi.org/10.1093/imamat/8.1.80. Christopher M Bishop and Nasser M Nasrabadi.Pat- tern recognition and machine learning, volume 4. Springer, 2006. Denis Blessing, Xiaogang Jia, Johannes Esslinger, Francisco Vargas, and Gerhard Neumann. Be- yond elbos: A large-scale evaluation of variational methods for sampling. InFor...

work page doi:10.1093/imamat/8.1.80 2006
[2]

Nicola Branchini and V´ ıctor Elvira

URLhttps://openreview.net/forum?id= fVg9YrSllr. Nicola Branchini and V´ ıctor Elvira. An adaptive mix- ture view of particle filters.Foundations of Data Science, 7(4), 2025. doi: 10.3934/fods.2024017. Monica F Bugallo, Victor Elvira, Luca Martino, David Luengo, Joaquin Miguez, and Petar M Djuric. Adaptive importance sampling: The past, the present, and th...

work page doi:10.3934/fods.2024017 2025
[3]

Statistical Science , author =

doi: 10.1214/18-STS668. Matteo Fasiolo, Fl´ avio Eler de Melo, and Simon Maskell. Langevin incremental mixture importance sampling.Statistics and Computing, 28(3):549–561, 2018. Axel Finke and Alexandre H Thiery. On importance- weighted autoencoders. 2019. URLhttps:// arxiv.org/abs/1907.10477. Michael B Giles. Multilevel monte carlo methods.Acta numerica,...

work page doi:10.1214/18-sts668 2018
[5]

Oskar Kviman, Harald Melin, Hazal Koptagel, Vic- tor Elvira, and Jens Lagergren

URLhttps://arxiv.org/abs/2503.19466. Oskar Kviman, Harald Melin, Hazal Koptagel, Vic- tor Elvira, and Jens Lagergren. Multiple im- portance sampling ELBO and deep ensembles of variational approximations. In Gustau Camps- Valls, Francisco J. R. Ruiz, and Isabel Valera, edi- tors,International Conference on Artificial Intelli- gence and Statistics, AISTATS ...

work page arXiv 2022
[6]

Thomas M¨ uller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Nov´ ak

URLhttp://proceedings.mlr.press/ v130/morningstar21b.html. Thomas M¨ uller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Nov´ ak. Neural importance sampling.ACM Transactions on Graphics (ToG), 38(5):1–19, 2019. Radford M Neal. Slice sampling.The annals of statis- tics, 31(3):705–767, 2003. Art Owen and Yi Zhou. Safe and effective importance ...

work page arXiv 2019
[7]

Baibo Zhang and Changshui Zhang

URLhttp://proceedings.mlr.press/v80/ yao18a.html. Baibo Zhang and Changshui Zhang. Finite mixture models with negative components. In4th Interna- tional Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), pages 31–

work page
[8]

Checklist

Springer, 2005. Checklist

work page 2005
[9]

Yes, provided throughout the paper, primar- ily§2 and§3,§A,§B

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. Yes, provided throughout the paper, primar- ily§2 and§3,§A,§B. (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. Yes, see§A.5. (c) (Optional) Anonymized sour...

work page
[10]

Yes, see Theorem 1, Prop

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. Yes, see Theorem 1, Prop. 1, and§B. (b) Complete proofs of all theoretical results. Yes, see§B. (c) Clear explanations of any assumptions. Yes, see Theorem 1, Prop. 1, and§B

work page
[11]

Yes, seehttps://github.com/ april-tools/delta-vi

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). Yes, seehttps://github.com/ april-tools/delta-vi. (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen...

work page
[12]

Yes, see§C

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. Yes, see§C. (b) The license information of the assets, if ap- plicable. Yes, see§C. (c) New assets either in the supplemental material or as a URL, if applicable. Yes, seehttp...

work page
[13]

f(x(s) + ) p(x(s) + ) q(x(s) + ) # − Z− Z 1 S− S−X s=1 Eq−

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. Not Applicable. (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. Not Applica- ble. (c) The estimated hourly wage paid t...

work page 2020

[1] [1]

URLhttps: //doi.org/10.1093/imamat/8.1.80

doi: 10.1093/imamat/8.1.80. URLhttps: //doi.org/10.1093/imamat/8.1.80. Christopher M Bishop and Nasser M Nasrabadi.Pat- tern recognition and machine learning, volume 4. Springer, 2006. Denis Blessing, Xiaogang Jia, Johannes Esslinger, Francisco Vargas, and Gerhard Neumann. Be- yond elbos: A large-scale evaluation of variational methods for sampling. InFor...

work page doi:10.1093/imamat/8.1.80 2006

[2] [2]

Nicola Branchini and V´ ıctor Elvira

URLhttps://openreview.net/forum?id= fVg9YrSllr. Nicola Branchini and V´ ıctor Elvira. An adaptive mix- ture view of particle filters.Foundations of Data Science, 7(4), 2025. doi: 10.3934/fods.2024017. Monica F Bugallo, Victor Elvira, Luca Martino, David Luengo, Joaquin Miguez, and Petar M Djuric. Adaptive importance sampling: The past, the present, and th...

work page doi:10.3934/fods.2024017 2025

[3] [3]

Statistical Science , author =

doi: 10.1214/18-STS668. Matteo Fasiolo, Fl´ avio Eler de Melo, and Simon Maskell. Langevin incremental mixture importance sampling.Statistics and Computing, 28(3):549–561, 2018. Axel Finke and Alexandre H Thiery. On importance- weighted autoencoders. 2019. URLhttps:// arxiv.org/abs/1907.10477. Michael B Giles. Multilevel monte carlo methods.Acta numerica,...

work page doi:10.1214/18-sts668 2018

[4] [5]

Oskar Kviman, Harald Melin, Hazal Koptagel, Vic- tor Elvira, and Jens Lagergren

URLhttps://arxiv.org/abs/2503.19466. Oskar Kviman, Harald Melin, Hazal Koptagel, Vic- tor Elvira, and Jens Lagergren. Multiple im- portance sampling ELBO and deep ensembles of variational approximations. In Gustau Camps- Valls, Francisco J. R. Ruiz, and Isabel Valera, edi- tors,International Conference on Artificial Intelli- gence and Statistics, AISTATS ...

work page arXiv 2022

[5] [6]

Thomas M¨ uller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Nov´ ak

URLhttp://proceedings.mlr.press/ v130/morningstar21b.html. Thomas M¨ uller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Nov´ ak. Neural importance sampling.ACM Transactions on Graphics (ToG), 38(5):1–19, 2019. Radford M Neal. Slice sampling.The annals of statis- tics, 31(3):705–767, 2003. Art Owen and Yi Zhou. Safe and effective importance ...

work page arXiv 2019

[6] [7]

Baibo Zhang and Changshui Zhang

URLhttp://proceedings.mlr.press/v80/ yao18a.html. Baibo Zhang and Changshui Zhang. Finite mixture models with negative components. In4th Interna- tional Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM), pages 31–

work page

[7] [8]

Checklist

Springer, 2005. Checklist

work page 2005

[8] [9]

Yes, provided throughout the paper, primar- ily§2 and§3,§A,§B

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. Yes, provided throughout the paper, primar- ily§2 and§3,§A,§B. (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. Yes, see§A.5. (c) (Optional) Anonymized sour...

work page

[9] [10]

Yes, see Theorem 1, Prop

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. Yes, see Theorem 1, Prop. 1, and§B. (b) Complete proofs of all theoretical results. Yes, see§B. (c) Clear explanations of any assumptions. Yes, see Theorem 1, Prop. 1, and§B

work page

[10] [11]

Yes, seehttps://github.com/ april-tools/delta-vi

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). Yes, seehttps://github.com/ april-tools/delta-vi. (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen...

work page

[11] [12]

Yes, see§C

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. Yes, see§C. (b) The license information of the assets, if ap- plicable. Yes, see§C. (c) New assets either in the supplemental material or as a URL, if applicable. Yes, seehttp...

work page

[12] [13]

f(x(s) + ) p(x(s) + ) q(x(s) + ) # − Z− Z 1 S− S−X s=1 Eq−

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. Not Applicable. (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. Not Applica- ble. (c) The estimated hourly wage paid t...

work page 2020