WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
super hub Mixed citations
Auto-Encoding Variational Bayes
Mixed citation behavior. Most common role is background (65%).
abstract
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods
authors
co-cited works
representative citing papers
RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, which recovers sources across manifolds in controlled tests.
MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
DDIMs construct non-Markovian diffusion processes that share DDPM training objectives but allow much faster reverse sampling, demonstrated empirically at 10-50x wall-clock speedup.
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
JointHOI jointly generates hand-object motion and distance-based contact maps in one diffusion stage to improve temporal stability and physical plausibility over prior multi-stage HOI methods.
A VAE-plus-Bayesian-optimization framework discovers new symbolic iterative optimization algorithms without assuming update function forms and faster than prior mathematical programming methods.
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Theoretical characterization of the inlier-memorization effect in simple autoencoders, deriving its emergence, strength, and persistence from data distribution and initialization, plus guidelines achieving SOTA on ADBench.
MammoFlow adds geometric alignment and EMD tissue-distribution consistency to a pretrained flow-matching model to generate anatomically paired mammograms, reporting superior quality and a 5% downstream AUC gain.
COMAD discovers and reuses coordination skills from mixed offline MARL data via auto-encoders and density-based estimation to achieve continual learning with better transfer.
DEN is an unsupervised neural framework that uses dimension expansion to enable efficient inverse design of nanophotonic structures from low-dimensional objectives via differentiable simulations.
Action-BED recasts BED as expected future loss on actions, producing singly intractable objectives jointly optimized for design and action policies via stochastic gradients without explicit posterior estimation.
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
citing papers explorer
-
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
-
Disentanglement Beyond Generative Models with Riemannian ICA
RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, which recovers sources across manifolds in controlled tests.
-
Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.
-
Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Denoising Diffusion Implicit Models
DDIMs construct non-Markovian diffusion processes that share DDPM training objectives but allow much faster reverse sampling, demonstrated empirically at 10-50x wall-clock speedup.
-
Categorical Reparameterization with Gumbel-Softmax
Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
-
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
-
JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation
JointHOI jointly generates hand-object motion and distance-based contact maps in one diffusion stage to improve temporal stability and physical plausibility over prior multi-stage HOI methods.
-
Symbolic Discovery of Iterative Algorithms: A Continuous Latent Space Bayesian Optimization Framework
A VAE-plus-Bayesian-optimization framework discovers new symbolic iterative optimization algorithms without assuming update function forms and faster than prior mathematical programming methods.
-
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
-
What Drives the Inlier-Memorization Effect? A Theory of Outlier Detection via Early Training Dynamics
Theoretical characterization of the inlier-memorization effect in simple autoencoders, deriving its emergence, strength, and persistence from data distribution and initialization, plus guidelines achieving SOTA on ADBench.
-
MammoFlow: Multiview Mammogram Synthesis with Anatomically Consistent Flow Matching
MammoFlow adds geometric alignment and EMD tissue-distribution consistency to a pretrained flow-matching model to generate anatomically paired mammograms, reporting superior quality and a 5% downstream AUC gain.
-
Offline Multi-agent Continual Cooperation via Skill Partition and Reuse
COMAD discovers and reuses coordination skills from mixed offline MARL data via auto-encoders and density-based estimation to achieve continual learning with better transfer.
-
Dimension expansion for simulation-efficient nanophotonic neural networks
DEN is an unsupervised neural framework that uses dimension expansion to enable efficient inverse design of nanophotonic structures from low-dimensional objectives via differentiable simulations.
-
Action-BED: Task-Driven Bayesian Experimental Design with Singly Intractable Objectives
Action-BED recasts BED as expected future loss on actions, producing singly intractable objectives jointly optimized for design and action policies via stochastic gradients without explicit posterior estimation.
-
AudioCALM: Continuous Autoregressive Language Modeling for Universal Audio Generation
AudioCALM presents a continuous autoregressive framework with flow-matching prediction and A-MoME architecture that unifies speech, sound, and music generation while matching modality-specific state-of-the-art performance.
-
Reaction-Network-Level Discovery of Ammonia Synthesis Catalysts via Ten-Million-Scale Generative Exploration
Ten-million-scale generative Transformers with ML potentials map compatibility across N*, NH*, NNH*, and HNNH* to discover 279 ammonia synthesis catalyst candidates, recovering Fe/Ru motifs and identifying new families like Fe-V and Al-Pd-Zr validated by DFT.
-
Causal Gaussian Processes for Robust Treatment Effect Evaluation with Unobserved Confounding
Develops Causal Gaussian Process models that approximate any causal model's observational and interventional distributions via universal discretization of exogenous domains for robust treatment effect evaluation under unobserved confounding.
-
Unsupervised Disentanglement Without Compromises : How Functional Orthogonality Enforces Identifiability
Enforcing local orthogonality on the Jacobian of the generative mapping yields identifiability for general nonlinear models when the latent domain has full combinatorial support.
-
$\Omega$: Operator-based Mixture Ensemble for Generative Assimilation
Ω is a generative assimilation method that learns residual discrepancies from ensemble data using a conditional Gaussian baseline, then reconstructs full non-Gaussian posteriors via Gaussian mixtures and annealed Langevin sampling.
-
Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting
Introduces Robust-TOOC benchmark for corrupted images and Dual-TTT test-time training that updates only a text-guided denoising module to boost robustness in open-vocabulary counting.
-
OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation
DRIVE-CHOREO uses three LLM agents to create a unified position-aware token sequence co-compressed with multi-view video, achieving SOTA BEV mAP of 21.6 and +2.4 NDS improvement on nuScenes.
-
Low-variance estimators overcome the phase-gradient bottleneck in complex-valued neural quantum states
Direct differentiation of the local energy at fixed samples yields an unbiased low-variance estimator for the variational Monte Carlo phase force in complex neural quantum states, with an adaptive mixture extending it to coupled networks and improving results on flux ladders, chiral chains, and frac
-
InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
-
CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation
CineOrchestra unifies control of subjects, events, cameras, and shot transitions in cinematic video generation through entity-centric conditioning primitives and parameter-free coordinated rotary embeddings.
-
Implicit Neural Representations of Individual Behavior
Behavioral INR adapts INRs to behavior by mapping states to actions with FiLM-modulated episode latents for self-supervised policy inference in unlabeled data, with new policy OOD definitions.
-
TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
-
Expected Free Energy-based Planning as Variational Inference
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
-
Lattice genome: representation and analysis of heterogeneous crystalline microstructures
The paper proposes lattice genes as VAE-encoded representations of Kikuchi diffraction patterns and lattice genomes as their spatial maps for analyzing heterogeneity in crystalline microstructures of Ni-base superalloys.
-
PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
-
A Hybrid Generative Reduced-Order Model for the Minimal Flow Unit
A β-VAE-GAN plus sensor-conditioned Transformer with Easy Attention forecasts near-wall turbulence in the Minimal Flow Unit, recovering 87% turbulent kinetic energy in 4D latent space and maintaining accuracy over 17288 t+ from 128 t+ initialization while reconstructing 82% TKE end-to-end.
-
Self-Consistent Generative Paths via Admissible Random Variational Transport
Defines generative probability paths as self-consistent when they form random fixed points of admissible variational transport operators and derives associated existence, attraction, and residual bounds.
-
Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records
Generative models for synthetic EMRs match marginal distributions but fail to preserve subgroup structure, effect estimates, and dependency structure simultaneously on the PRIME-CVD cohort.
-
MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
MaskAlign uses random token-subset alignment and pre-mask mixing to reduce diffusion models' reliance on complete clean-image token sets during representation alignment.
-
Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing
Introduces adaptive clip partitioning and anchor-based editing to preserve temporal structure in zero-shot video editing.
-
CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations
CoMetaPNS combines meta-learned neural surrogates with a continual Bayesian Gaussian Mixture Model to adapt cardiac electrophysiology simulations to new data while avoiding catastrophic forgetting.
-
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
-
TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation
TrioPose proposes a Triple-Stream Pose-Aware DiT with relational bias masks and spatial loss weighting to achieve SOTA pose-guided text-to-image results on multi-person benchmarks like Human-Art.
-
Testing Equality of Conditional Distributions via Generative Models
A generative-model-based test for equality of conditional distributions that uses cross-generation, an RKHS-indexed supremum statistic, and multiplier bootstrap, with claimed double robustness to generator errors.
-
MediEncoder: Nonlinear Representation Learning for High-Dimensional Causal Mediation Analysis
MediEncoder jointly learns nonlinear low-dimensional covariate and mediator representations via a coupled encoder-decoder with cross-factor network, then applies them in an efficient influence function estimator for natural direct and indirect effects.
-
Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
A new benchmark for counterfactual epidemic prediction under dynamic interventions is generated from a real-data-calibrated agent-based model and used to compare causal inference methods.
-
CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors
CoFi-UCGen achieves both coarse- and fine-grained unsupervised conditional image generation by using bit-codes for structured latent space and hierarchical modulation in diffusion models.
-
Balancing Image Compression and Generation with Bootstrapped Tokenization
SelfBootTok decomposes image tokens into global and local groups via self-bootstrapped learning, enabling generators to use only global tokens for ~40% less computation and a new SOTA gFID of 1.56 with 64 tokens.
-
What Type of Inference is Active Inference?
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
-
ChannelTok: Efficient Flexible-Length Vision Tokenization
ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.
-
scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
scTranslation is a benchmark that assembles diverse single-cell multi-omics datasets, integrates existing translation models, and evaluates them under feature selection, data quality, and few-shot scenarios to identify performance factors.