Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

Dongsung Huh; Lior Horesh; Ma{\l}gorzata J Zimo\'n; Ram\'on Nartallo-Kaluarachchi; Robert Manson-Sawko; Shashanka Ubaru; Yoshua Bengio

arxiv: 2510.21523 · v2 · pith:7JMBY4MRnew · submitted 2025-10-24 · 💻 cs.LG · stat.ML

Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

Ram\'on Nartallo-Kaluarachchi , Shashanka Ubaru , Ma{\l}gorzata J Zimo\'n , Dongsung Huh , Robert Manson-Sawko , Lior Horesh , Yoshua Bengio This is my paper

Pith reviewed 2026-05-21 19:49 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords epistemic uncertaintypolynomial chaos expansionGFlowNetsSobol sensitivity indicessequential generative modelsinterpretable uncertaintyreward decomposition

0 comments

The pith

Fitting polynomial chaos expansions to small GFlowNet ensembles yields analytical Sobol indices that decompose epistemic uncertainty by reward component.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to propagate uncertainty from imperfect reward estimates through sequential generative models by training small ensembles of GFlowNets and fitting polynomial chaos expansions to their outputs. The expansion coefficients then supply closed-form Sobol sensitivity indices that attribute generative decisions to specific reward terms. This decomposition is unavailable from standard uncertainty methods like ensembles or dropout. In practice the indices expose which design choices remain stable and which shift sharply when reward estimates vary, turning opaque uncertainty into targeted guidance for scientific tasks such as catalyst screening and molecular design.

Core claim

By fitting polynomial chaos expansions to small ensembles of trained GFlowNets, the resulting coefficients deliver analytical Sobol sensitivity indices that decompose the epistemic uncertainty inherited from uncertain rewards into contributions from individual reward components, with theoretical convergence guarantees and empirical calibration coverage of 0.97-1.00 at the 95 percent level across the dominant generative steps.

What carries the argument

Polynomial chaos expansions fitted to model ensembles, whose coefficients directly compute Sobol sensitivity indices that quantify the influence of each reward component on downstream generative choices.

If this is right

Catalyst selection on the Buchwald-Hartwig dataset remains robust while additive selection is approximately 2.5 times more fragile under reward uncertainty.
In fragment-based molecular design the linker position emerges as the most sensitive element, reversing the usual scaffold-robust versus decoration-fragile pattern.
On the Sachs protein network, MAPK-cascade edges and PKA/PKC hub edges fall into distinct sensitivity regimes that can guide targeted perturbation experiments.
The surrogate evaluates ten thousand policy samples in milliseconds, three to four orders of magnitude faster than exhaustive retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same surrogate construction could be applied to other sequential generators to obtain interpretable sensitivity maps without retraining costs.
Reward design in discovery pipelines could be refined by first identifying and stabilizing the high-sensitivity components flagged by these indices.
The approach opens a route to adaptive experiment selection that prioritizes measurements reducing uncertainty in the most fragile generative steps.

Load-bearing premise

That polynomial chaos expansions fitted to small ensembles of trained GFlowNets propagate and decompose the epistemic uncertainty without large approximation error in the resulting sensitivity indices.

What would settle it

A direct comparison showing that Sobol indices obtained from the polynomial chaos surrogate differ substantially from indices recomputed by exhaustive retraining of many independent GFlowNets on the same reward ensembles would falsify the accuracy of the decomposition.

read the original abstract

Sequential generative models conditioned on uncertain rewards are central to AI-driven scientific discovery, yet the epistemic uncertainty they inherit from imperfect reward estimates remains unquantified. We propagate this uncertainty through generative flow networks (GFlowNets) by fitting polynomial chaos expansions (PCEs) to small ensembles of trained models. The PCE coefficients yield analytical Sobol sensitivity indices, providing the first interpretable decomposition of which reward components drive which generative decisions, a capability unavailable from deep ensembles, Bayesian neural networks, or Monte Carlo dropout. Convergence guarantees are established theoretically and four of five are formally verified in the Lean 4 proof assistant. Across three real-world tasks the framework reveals actionable structure invisible to ensembles alone. On the Doyle-Dreher Buchwald-Hartwig dataset catalyst selection is robust ($D_{\mathrm{catalyst}}\approx 71$) while additive selection is fragile ($D_{\mathrm{additive}}\approx 179$, $2.5\times$ higher). In fragment-based molecular design the linker position is the most sensitive ($D_{\mathrm{linker}}\approx 28$) while decoration positions are the most robust ($D\approx 14$-$18$), reversing the conventional scaffold-robust / decoration-fragile assumption. On the Sachs protein signalling network, MAPK-cascade edges and PKA/PKC hub edges separate into distinct sensitivity regimes, providing a targeted map for perturbation experiments. Calibration coverage at the 95% level reaches 0.97-1.00 across the dominant steps, and the surrogate evaluates 10{,}000 policy samples in milliseconds - $10^{3}$-$10^{4}\times$ faster than exhaustive retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCE on small GFlowNet ensembles yields analytical Sobol indices for epistemic uncertainty, with partial Lean verification and clear task-level structure, but the surrogate error bound is the part that still needs tighter checking.

read the letter

The core advance is fitting polynomial chaos expansions to ensembles of trained GFlowNets so the resulting coefficients give exact Sobol indices that decompose which reward components most affect the generated trajectories. That combination, plus four of five convergence statements checked in Lean 4, is not something I have seen in the GFlowNet or uncertainty-quantification literature before. On the three tasks the numbers are concrete: catalyst selection stays robust while additive choice is 2.5 times more sensitive in the Buchwald-Hartwig data; linker position dominates in fragment design while decorations are steadier; and the Sachs network cleanly separates MAPK-cascade edges from PKA/PKC hubs. Calibration at 95 % lands between 0.97 and 1.00, and the surrogate is three to four orders of magnitude faster than retraining. Those are the parts that actually move the needle for interpretability in scientific generative work. The remaining question is how much truncation or fitting error in the PCE itself leaks into the sensitivity rankings. The paper uses small ensembles and reports no explicit bound on the downstream index distortion when the underlying policy is moderately nonlinear. If that error is comparable to the reported 2.5-fold differences, the claimed advantage over plain ensembles shrinks. The abstract and stress-test note both flag this as the load-bearing assumption, and the full text does not appear to close it with a separate error-propagation argument or larger-ensemble ablation. Readers who already use GFlowNets for molecule or reaction design will find the sensitivity maps useful even if the error analysis is only partial. The formal verification and the reversal of the usual scaffold-fragile intuition are enough to justify sending the manuscript to referees rather than a desk reject; a reviewer can ask for the missing error quantification without needing to rewrite the whole story.

Referee Report

2 major / 2 minor

Summary. The paper proposes propagating epistemic uncertainty from imperfect reward estimates through GFlowNets by fitting polynomial chaos expansions (PCEs) to small ensembles of trained models. The resulting PCE coefficients enable analytical Sobol sensitivity indices that decompose which reward components drive generative decisions. Theoretical convergence guarantees are derived, with four of five formally verified in Lean 4. Experiments on the Doyle-Dreher Buchwald-Hartwig dataset, fragment-based molecular design, and the Sachs protein signalling network report high calibration coverage (0.97-1.00) and actionable sensitivity rankings, such as robust catalyst selection (D_catalyst ≈ 71) versus fragile additive selection (D_additive ≈ 179). The surrogate is claimed to be 10^3-10^4× faster than retraining.

Significance. If the PCE surrogate approximation error remains negligible relative to the reported sensitivity differences, the work provides a valuable new capability for interpretable epistemic uncertainty decomposition in sequential generative models, unavailable from standard ensemble or dropout methods. The formal verification of convergence guarantees and the empirical demonstration of reversed conventional assumptions (e.g., linker vs. decoration sensitivity) are notable strengths. The computational efficiency of the surrogate further supports practical utility in scientific discovery tasks.

major comments (2)

[Abstract and uncertainty propagation section] The central claim that PCE coefficients from small GFlowNet ensembles yield faithful analytical Sobol indices rests on the premise that truncation and estimation error do not distort sensitivity rankings. No quantitative bound or ablation is visible showing that approximation error is substantially smaller than the reported effect sizes (e.g., the 2.5× gap between D_catalyst and D_additive). This is load-bearing for the interpretability advantage over deep ensembles.
[Methods and theoretical guarantees] The mapping from GFlowNet trajectory distributions (high-dimensional discrete spaces) to PCE inputs assumes moderate nonlinearity; the manuscript should explicitly test or bound the impact of higher-order interactions on the downstream Sobol indices when ensemble size is small.

minor comments (2)

[Experimental setup] Clarify the exact ensemble size used for PCE fitting and the truncation order selection procedure, as these are listed as free parameters.
[Results] Add a direct comparison table of sensitivity rankings obtained from the PCE surrogate versus a larger ensemble or Monte Carlo reference to quantify any ranking discrepancies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and insightful comments. These have helped us identify areas where additional evidence can strengthen the manuscript's claims regarding the reliability of the PCE-derived Sobol indices. We respond to each major comment below and indicate the changes we will implement.

read point-by-point responses

Referee: [Abstract and uncertainty propagation section] The central claim that PCE coefficients from small GFlowNet ensembles yield faithful analytical Sobol indices rests on the premise that truncation and estimation error do not distort sensitivity rankings. No quantitative bound or ablation is visible showing that approximation error is substantially smaller than the reported effect sizes (e.g., the 2.5× gap between D_catalyst and D_additive). This is load-bearing for the interpretability advantage over deep ensembles.

Authors: We agree that demonstrating the approximation error is substantially smaller than the reported sensitivity differences is crucial to support the interpretability claims. While the reported calibration coverage of 0.97-1.00 offers indirect evidence of fidelity, we acknowledge that an explicit quantitative ablation is absent. In the revised manuscript we will add a dedicated ablation study in the uncertainty propagation section that quantifies PCE truncation and estimation errors across the ensemble sizes used in the experiments. This study will directly compare the error magnitudes to the observed effect sizes (including the 2.5× gap between D_catalyst and D_additive) and will show that the errors remain at least an order of magnitude smaller, thereby reinforcing the advantage over standard ensembles. revision: yes
Referee: [Methods and theoretical guarantees] The mapping from GFlowNet trajectory distributions (high-dimensional discrete spaces) to PCE inputs assumes moderate nonlinearity; the manuscript should explicitly test or bound the impact of higher-order interactions on the downstream Sobol indices when ensemble size is small.

Authors: The referee correctly notes that the theoretical guarantees rely on moderate nonlinearity in the mapping from high-dimensional discrete trajectory distributions to PCE inputs. Although the convergence results are stated under conditions that bound higher-order contributions, we have not provided explicit empirical tests of their impact for small ensembles. In the revised methods section we will include a controlled synthetic experiment that systematically varies the degree of nonlinearity and ensemble size, then measures the resulting deviation in the computed Sobol indices. This will supply a practical bound on the influence of higher-order interactions and will clarify the operating regime for discrete generative tasks. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Ledger populated from abstract only; full manuscript may list additional fitted quantities such as PCE degree or ensemble cardinality.

free parameters (1)

PCE truncation order and ensemble size
Required to fit the surrogate but not numerically specified in the abstract.

axioms (1)

standard math Convergence of the PCE approximation to the true uncertainty propagation map
Invoked to justify analytical Sobol indices; four of five guarantees formally verified in Lean 4.

pith-pipeline@v0.9.0 · 5874 in / 1370 out tokens · 99322 ms · 2026-05-21T19:49:59.861009+00:00 · methodology

Interpretable epistemic uncertainty decomposition in sequential generative models via polynomial chaos surrogates

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)