pith. sign in

arxiv: 2605.18472 · v1 · pith:JXRH5IKCnew · submitted 2026-05-18 · 📊 stat.ML · cs.AI· cs.LG

Flowing with Confidence

Pith reviewed 2026-05-20 08:19 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords flow matchinggenerative modelsuncertainty estimationconfidence scoresODE integrationvariance propagationcontinuous normalizing flows
0
0 comments X

The pith

Flow Matching with Confidence yields per-sample scores by injecting input-dependent multiplicative noise and propagating its variance in closed form along the ODE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative models produce outputs of uneven reliability, yet existing ways to measure that reliability multiply compute by running ensembles or repeated trajectories. The paper introduces Flow Matching with Confidence, which adds multiplicative noise at chosen layers, tracks how its variance spreads through the network exactly, and integrates the result along the sampling ODE. This produces a usable per-sample confidence number at ordinary sampling cost. If correct, the score lets users filter unreliable outputs for higher quality, rewind and redirect uncertain trajectories, and focus extra steps only where the flow is ambiguous. A sympathetic reader would care because it turns opaque generation into something whose trustworthiness can be assessed without slowing the process.

Core claim

Flow Matching with Confidence (FMwC) injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost.

What carries the argument

Closed-form propagation of variance from injected multiplicative noise, integrated along the continuous flow ODE.

If this is right

  • Filtering low-confidence samples raises image quality and improves thermodynamic stability of generated crystals.
  • Trajectories can be edited by rewinding to high-uncertainty points and redirecting the flow.
  • Adaptive ODE stepping concentrates computation where the velocity field is most ambiguous.
  • The confidence score correlates with the magnitude of the divergence of the learned velocity field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variance-propagation technique could be adapted to other continuous normalizing flow or diffusion models that evolve via ODEs.
  • The observed correlation with velocity divergence might motivate new regularization terms during training to produce more stable flows.
  • Surgical guidance methods could be developed that intervene only at the moments the score flags as uncertain.

Load-bearing premise

The method assumes that the tracked variance of the injected noise accurately reflects the model's true uncertainty about its outputs rather than merely recording an auxiliary signal.

What would settle it

Measure whether the computed scores predict actual per-sample error rates on data with ground truth, or test whether filtering generations by the score measurably raises average quality metrics; failure on either test would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18472 by Dario Coscia, Erik Bekkers, Friso de Kruiff, Max Welling.

Figure 1
Figure 1. Figure 1: Flow-Matching model with Confidence. (Left) Variance is propagated alongside the sample as it is generated along the path p0→p1: well-placed samples (green) contract into a tight ±σ band, while misplaced ones (red) stay diffuse. (Right) Sorting by the resulting confidence score recovers quality across modalities—low-energy crystals and clean digits at high confidence, implausible structures and malformed d… view at source ↗
Figure 2
Figure 2. Figure 2: σ 2 t recovers the divergence structure of the learned velocity field at bifurcation time. Spatial fields at t=0.6 on the Checkerboard, where |∇·vt,θ| (right) is available analytically. Quantitative correlation across modalities in Sec. 4. 3.4 Where and When Does Uncertainty Arise? The construction above is mechanically motivated, but the reason the resulting σ 2 t is informative is geometric. The optimal … view at source ↗
Figure 3
Figure 3. Figure 3: FMwC’s per-channel t ⋆ targets a specific bifurcation; FM at random t misses it. (a) Seed Dy Er Ni4 (on the hull) under chem-swap (atoms-only at t ⋆ atom) and polymorph (coords-only at t ⋆ coord): FMwC’s edits stay near the hull, FM’s drift off. (b–d) Per-mode success rate over 50 seeds × 20 replicates; FMwC (t=t ⋆ , solid) vs FM (uniform random t, hatched), at matched channel mask and σ. Success requires … view at source ↗
Figure 4
Figure 4. Figure 4: A single per-sample adaptive-stepping signal, σ 2 t (t), generalises across modalities. Each panel plots the application’s headline quality metric against integrator step budget N, oriented so down-and-right is better (Crystal y-axis inverted). Methods: FM Uniform (grey), FMwC Uniform (blue), and FMwC Online Adaptive (green), with the controller chosen per modality. Adaptive stepping reallocates compute to… view at source ↗
Figure 5
Figure 5. Figure 5: On the Crystals model, one forward pass of FMwC matches Hutchinson’s divergence. Spearman and Pearson correlation against the high-K Hutchinson reference vs. measured FLOPs per query, on the FlowMM backbone; sweep over K ∈ {1, . . . , 500}. Second, the correspondence is universal and geometry-graded [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Generative models can produce nonsensical text, unrealistic images, and unstable materials faster than simulation or human review can absorb; without per-sample confidence, trust erodes. Existing fixes run $k$ ensembles or stochastic trajectories at $k\times$ compute, measuring variability between models, not model confidence. We propose Flow Matching with Confidence (FMwC). FMwC injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost. The score supports multiple uses: filtering improves image quality and thermodynamic stability of crystals; editing rewinds trajectories to the points where the model commits and redirects them; and adaptive stepping concentrates ODE compute where the flow is ambiguous. We find that the confidence score correlates with the magnitude of the divergence of the learned velocity field, which gives us a window to understand the generative process, opening up surgical forms of guidance that target the moments that matter, new sampling algorithms and interpretability of generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Flow Matching with Confidence (FMwC) for generative flow models. It injects input-dependent multiplicative noise at selected layers, derives a closed-form propagation of the resulting variance through the network, and integrates this quantity along the ODE trajectory to obtain a per-sample confidence score at standard sampling cost. The score is applied to sample filtering (improving image quality and crystal stability), trajectory editing (rewinding to commitment points), and adaptive ODE stepping. The authors additionally report a correlation between the derived confidence and the magnitude of the divergence of the learned velocity field.

Significance. If the propagated variance can be shown to track genuine model uncertainty rather than an auxiliary quantity, the method would supply an efficient, single-trajectory uncertainty measure for flow-matching models. This could enable practical improvements in generation reliability and open new directions for interpretability and adaptive sampling without ensemble overhead.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the central claim that closed-form variance propagation of the injected multiplicative noise yields a meaningful per-sample confidence score rests on the untested modeling assumption that this auxiliary quantity corresponds to true epistemic uncertainty or predictive error. The reported correlation with velocity-field divergence magnitude is presented as validation, yet no direct comparisons to ground-truth measures (ensemble variance, reconstruction error on held-out trajectories, or calibration metrics) are described, leaving the mapping from propagated variance to 'confidence' as an open assumption rather than an established result.
  2. [§4] §4 (experimental validation): the free parameters (choice of layers for noise injection and the noise scale) are listed in the axiom ledger but their selection procedure and sensitivity are not quantified. If these choices are post-hoc tuned on the same data used to demonstrate correlation with divergence, the reported utility for filtering and editing may not generalize.
minor comments (2)
  1. [§3] Notation for the variance propagation step should be introduced with an explicit equation (e.g., Eq. (X)) rather than described only in prose, to allow readers to verify the closed-form claim for standard residual or attention blocks.
  2. [Figures 4-6] Figure captions for the editing and adaptive-stepping examples should include quantitative metrics (e.g., FID improvement or wall-clock savings) alongside qualitative visuals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and experimental details that we will address in revision. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the central claim that closed-form variance propagation of the injected multiplicative noise yields a meaningful per-sample confidence score rests on the untested modeling assumption that this auxiliary quantity corresponds to true epistemic uncertainty or predictive error. The reported correlation with velocity-field divergence magnitude is presented as validation, yet no direct comparisons to ground-truth measures (ensemble variance, reconstruction error on held-out trajectories, or calibration metrics) are described, leaving the mapping from propagated variance to 'confidence' as an open assumption rather than an established result.

    Authors: We agree that the correspondence between the propagated variance and true epistemic uncertainty is an assumption that would benefit from stronger empirical support. The closed-form propagation is derived directly from the network architecture and the input-dependent noise injection, ensuring it consistently tracks the effect of input perturbations through the ODE. The correlation with velocity-field divergence is offered as supporting evidence, since divergence magnitude in flow models often reflects regions of higher generative sensitivity. We did not perform direct comparisons to ensemble variance or calibration metrics in the submitted version in order to emphasize the single-trajectory efficiency. We will revise the manuscript to explicitly discuss this modeling assumption as a limitation and to add preliminary comparisons against ensemble-based uncertainty estimates. revision: yes

  2. Referee: [§4] §4 (experimental validation): the free parameters (choice of layers for noise injection and the noise scale) are listed in the axiom ledger but their selection procedure and sensitivity are not quantified. If these choices are post-hoc tuned on the same data used to demonstrate correlation with divergence, the reported utility for filtering and editing may not generalize.

    Authors: The referee correctly identifies that the selection procedure and sensitivity of the noise-injection layers and scale require clearer documentation. These hyperparameters were chosen to target layers operating at multiple feature scales while keeping the injected variance small enough to act as a perturbation rather than dominate the signal; details appear in the supplementary material. To address potential concerns about post-hoc tuning and generalization, we will revise §4 to include an explicit description of the selection criteria and a sensitivity analysis demonstrating that filtering and editing performance remains stable across reasonable variations of these parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the FMwC confidence score derivation

full rationale

The paper constructs the per-sample confidence score directly from the closed-form propagation of input-dependent multiplicative noise variance through the network layers and its integration along the ODE trajectory. This is presented as the definition of the score rather than a prediction of an independent quantity. The correlation with the magnitude of the divergence of the learned velocity field is reported as a post-hoc empirical observation, not as a definitional identity or fitted result. No load-bearing self-citations or uniqueness theorems from prior work by the authors are invoked to justify the core method. The derivation chain is self-contained and does not reduce to tautology by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions of flow matching (continuous ODE trajectories, differentiable velocity fields) and neural-network forward passes; no new physical entities are postulated. Layer selection and noise scale are implicit free parameters whose values are not reported in the abstract.

free parameters (2)
  • selected layers for noise injection
    Choice of which layers receive the multiplicative noise; affects both computational overhead and the resulting confidence signal.
  • noise scale parameter
    Magnitude of the injected input-dependent noise; must be chosen to produce a useful confidence range.
axioms (2)
  • standard math The network is differentiable and the velocity field admits a well-defined ODE trajectory.
    Invoked when integrating variance along the sampling path.
  • domain assumption Variance propagation remains accurate under the chosen network architecture and activation functions.
    Required for the closed-form claim to hold without simulation.

pith-pipeline@v0.9.0 · 5713 in / 1539 out tokens · 35863 ms · 2026-05-20T08:19:34.620753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 2 internal anchors

  1. [1]

    The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability.arXiv preprint arXiv:2310.17467,

    Luca Ambrogioni. The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability.arXiv preprint arXiv:2310.17467,

  2. [2]

    Efficient epistemic uncertainty estimation in regression ensemble models using pairwise-distance estimators.arXiv preprint arXiv:2308.13498,

    Lucas Berry and David Meger. Efficient epistemic uncertainty estimation in regression ensemble models using pairwise-distance estimators.arXiv preprint arXiv:2308.13498,

  3. [3]

    Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K

    Siddharth Betala, Samuel P Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clé- mentine Fourrier, Nikita Kazeev, Chaitanya K Joshi, Sékou-Oumar Kaba, et al. Lemat-genbench: A unified evaluation framework for crystal generative models.arXiv preprint arXiv:2512.04562,

  4. [4]

    Riemannian flow matching on general geometries,

    Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint arXiv:2302.03660,

  5. [5]

    BLIPs: Bayesian Learned Interatomic Potentials

    Dario Coscia, Pim de Haan, and Max Welling. BLIPs: Bayesian Learned Interatomic Potentials. arXiv preprint arXiv:2508.14022, 2025a. Dario Coscia, Max Welling, Nicola Demo, and Gianluigi Rozza. Barnn: A bayesian autoregressive and recurrent neural network.arXiv preprint arXiv:2501.18665, 2025b. Thomas M Cover and Joy A Thomas. Network information theory. I...

  6. [6]

    arXiv preprint arXiv:1912.02757 (2019)

    Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A Loss Landscape Perspective.arXiv preprint arXiv:1912.02757,

  7. [7]

    Generative uncertainty in diffusion models.arXiv preprint arXiv:2502.20946, 2025

    Metod Jazbec, Eliot Wong-Toi, Guoxuan Xia, Dan Zhang, Eric Nalisnick, and Stephan Mandt. Generative uncertainty in diffusion models.arXiv preprint arXiv:2502.20946,

  8. [8]

    A geometric explanation of the likelihood ood detection paradox

    Hamidreza Kamkari, Brendan Leigh Ross, Jesse C Cresswell, Anthony L Caterini, Rahul G Krishnan, and Gabriel Loaiza-Ganem. A geometric explanation of the likelihood ood detection paradox. arXiv preprint arXiv:2403.18910,

  9. [9]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  10. [10]

    Flowmm: Generating materials with riemannian flow matching.arXiv preprint arXiv:2406.04713,

    Benjamin Kurt Miller, Ricky TQ Chen, Anuroop Sriram, and Brandon M Wood. Flowmm: Generating materials with riemannian flow matching.arXiv preprint arXiv:2406.04713,

  11. [11]

    Do Deep Generative Models Know What They Don't Know?

    Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know?arXiv preprint arXiv:1810.09136,

  12. [12]

    Defog: Discrete flow matching for graph generation.arXiv preprint arXiv:2410.04263,

    Yiming Qin, Manuel Madeira, Dorina Thanou, and Pascal Frossard. Defog: Discrete flow matching for graph generation.arXiv preprint arXiv:2410.04263,

  13. [13]

    Uma: A family of universal models for atoms

    Brandon M Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R Kitchin, Daniel S Levine, et al. UMA: A Family of Universal Models for Atoms.arXiv preprint arXiv:2506.23971,

  14. [14]

    14 A.2 Continuous Transformations for Generative Modeling

    12 FLOWING WITHCONFIDENCEAPPENDIX TABLE OFCONTENTS A Notation and Background 14 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Continuous Transformations for Generative Modeling . . . . . . . . . . . . . . . . 14 A.3 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15...

  15. [15]

    Instead of treating neural network parameters as fixed, Bayesian approaches consider the weights as random variables drawn from a prior distribution, ω∼p(ω)

    2i .(14) A.3 Bayesian Neural Networks Bayesian perspectives have played an important role in the development of deep learning methods, offering a coherent probabilistic framework for representing and reasoning about uncertainty [Hinton and Van Camp, 1993, Graves, 2011, Blundell et al., 2015, Gal and Ghahramani, 2016]. Instead of treating neural network pa...

  16. [16]

    Parameterizing function-space distributions.So far, we have reasoned about a variational poste- rior over vector fields vt in an abstract, infinite-dimensional function space

    for reference. Parameterizing function-space distributions.So far, we have reasoned about a variational poste- rior over vector fields vt in an abstract, infinite-dimensional function space. While mathematically elegant, this formulation is not directly implementable: we cannot store or sample arbitrary functions. A natural solution is toparameterize the ...

  17. [17]

    (37) This establishes equivalence in the deterministic case

    showed: ∇θEt∼U(0,1),x∼p t(x) ∥ut(x)−v t,θ(x)∥2 =∇ θEx1∼pdata, t∼U(0,1),x t∼pt(x|x1) ∥vt,θ(xt)−u t(xt |x 1)∥2 . (37) This establishes equivalence in the deterministic case. Step 2: Bayesian lifting via reparameterization.Suppose ω∼q ψ(ω) admits a reparameterization ω=g ψ(ϵ) with ϵ∼p(ϵ) . Then, for any function f(ω) , the reparametrization trick [Kingma and...

  18. [18]

    keep nothing

    2i −KL({α i}i≥1).(42) With,x 1 ∼p data, t∼ U[0,1],x t ∼p t(x|x 1),ϵ∼ N(0, I). Here: •g ψ(x1,ϵ) implements the V AD reparameterization (41), with input-dependent adaptive dropout scalesα i(x1). • Sampling fromϵimplements the stochasticity of the variational posteriorq ψ. • The Kullback Leibler term only depends on the adaptive dropout coefficients, as impl...

  19. [19]

    Class conditioning, when used, is provided as a learned class-embedding vector added to the time embedding before the residual broadcast. Inference networks Eγ are attached to every convolutional and linear layer of the UNet (downsampling stages, bottleneck, upsampling stages), with the same two-layer- MLP construction used on Checkerboard; each Eγ takes ...

  20. [20]

    (1) Stochastic: draw one weight sample ω∼q ψ(ω|x, t) at each ODE step and integrate that single trajectory

    Method Mispl.%↓KL↓ Output-FMwC MAP 4.3% 0.088 FMwC MAP 4.6% 0.084 D.2 Sampling Strategy Ablation The variational posterior over weights admits three natural decoders for sample generation. (1) Stochastic: draw one weight sample ω∼q ψ(ω|x, t) at each ODE step and integrate that single trajectory. (2)Mean velocity ( k=5): draw k weight samples per step, ave...

  21. [21]

    trajectory-integrated AUPRC on the Checkerboard for the analytical variance estimators

    Propagation Endpoint AUPRC↑ 1st-order 0.30 2nd-order 0.30 Gauss-Hermite 0.30 Table 11:Trajectory-accumulated scoring lifts FMwC’s analytical AUPRC by +0.03 at one trajectory.Endpoint vs. trajectory-integrated AUPRC on the Checkerboard for the analytical variance estimators. Method Endpoint AUPRC↑Traj-int AUPRC↑ FMwC 0.30 0.33 FMwC Online Adaptive, deploye...