super hub Mixed citations

Flow Matching for Generative Modeling

Heli Ben-Hamu, Maximilian Nickel, Ricky TQ Chen, Yaron Lipman · 2022 · cs.LG · arXiv 2210.02747

Mixed citation behavior. Most common role is method (47%).

605 Pith papers citing it

Method 47% of classified citations

open full Pith review browse 605 citing papers more from Heli Ben-Hamu arXiv PDF

abstract

We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 79 method 74 baseline 5

citation-polarity summary

use method 74 background 71 unclear 6 baseline 5 support 2

claims ledger

abstract We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more

authors

and Matt Le Heli Ben-Hamu Maximilian Nickel Ricky TQ Chen Yaron Lipman

co-cited works

representative citing papers

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.

Generative Modeling with Flux Matching

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Flux Matching generalizes score-based generative modeling by using a weaker objective that admits infinitely many non-conservative vector fields with the data as stationary distribution, enabling new design choices beyond traditional score matching.

A-CODE: Fully Atomic Protein Co-Design with Unified Multimodal Diffusion

q-bio.QM · 2026-05-05 · unverdicted · novelty 8.0

A-CODE presents a fully atomic one-stage multimodal diffusion model for protein co-design that claims superior unconditional generation performance over prior one- and two-stage models plus a tenfold success-rate gain on hard binder-design tasks.

Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching

cs.LG · 2026-05-01 · unverdicted · novelty 8.0 · 3 refs

Derives closed-form posterior covariance for flow matching from divergence of velocity field, enabling post-hoc uncertainty on pre-trained models including one-step generators.

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

cs.LG · 2026-04-29 · unverdicted · novelty 8.0 · 3 refs

FMRG reformulates guidance as deterministic optimal control, deriving a single-trajectory method using the flow map that matches or exceeds baselines on reward-guided generation and inverse problems with 3 NFEs at text-to-image scale.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

Query Lower Bounds for Diffusion Sampling

cs.LG · 2026-04-12 · unverdicted · novelty 8.0

Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Generative models on phase space

hep-ph · 2026-04-02 · unverdicted · novelty 8.0

Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

cs.CV · 2026-03-30 · unverdicted · novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Building Normalizing Flows with Stochastic Interpolants

cs.LG · 2022-09-30 · conditional · novelty 8.0 · 2 refs

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world robotic platforms.

Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

Introduces a Bridge latent interface that maps mismatched student latents into teacher space, enabling distillation from modern diffusion teachers to compact one-step students and raising SD 1.5 HPSv3 from 5.4 to 9.4 while keeping one-step speed.

OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

OOPSIEVERSE is a new damage-aware simulation benchmark for household robot manipulation that converts contact, thermal, and fluid signals into task-agnostic damage metrics and demonstrates uses in safer policy learning and benchmarking.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

A Distributionally Robust Framework for Learned Reconstructions in Inverse Problems

math.OC · 2026-06-29 · unverdicted · novelty 7.0

Introduces structured DRO for learned inverse problem reconstructions with ambiguity sets aligned to the forward operator, yielding explicit dual representations and a worst-case bound that induces Tikhonov regularization on the operator Lipschitz constant.

SurgVLA-Bench: Towards Evaluating Vision-Language-Action Models for Laparoscopic Surgical Robotics

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

SurgVLA-Bench supplies a hierarchical task taxonomy and multi-dimensional evaluation framework for VLA models in laparoscopic robotics simulation, showing autoregressive models excel at semantics while flow-matching models achieve higher precision but all fall short due to endoscopic view constraint

CORDEX-ML-Bench: A Benchmark for Data-Driven Regional Climate Downscaling -Experiment Design and Overview

physics.ao-ph · 2026-06-28 · unverdicted · novelty 7.0

CORDEX-ML-Bench benchmarks 40 ML models for climate downscaling and finds generative models outperform deterministic ones on precipitation while historically trained models underestimate future climate signals.

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

cs.AI · 2026-06-28 · conditional · novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

Masked Diffusion Decoding as $x$-Prediction Flow

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.

citing papers explorer

Showing 36 of 36 citing papers after filters.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 48 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 19 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 51 · internal anchor
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models cs.CV · 2026-05-09 · unverdicted · none · ref 20 · 2 links · internal anchor
ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 31 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 27 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 43 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 23 · internal anchor
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 48 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 39 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control cs.CV · 2026-04-07 · unverdicted · none · ref 25 · internal anchor
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos cs.CV · 2026-01-15 · unverdicted · none · ref 50 · internal anchor
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments cs.CV · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Introduces GAP datasets of synthetic heavily eroded irregular puzzle pieces modeled on archaeological fragments and PuzzleFlow, a ViT and flow-matching framework that outperforms prior jigsaw solvers on these datasets.
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 18 · 2 links · internal anchor
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 25 · internal anchor
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.
Embody4D: A Generalist Data Engine for Embodied 4D World Modeling cs.CV · 2026-05-03 · unverdicted · none · ref 32 · 2 links · internal anchor
Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 61 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 40 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks cs.CV · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization cs.CV · 2026-04-11 · unverdicted · none · ref 19 · internal anchor
PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 61 · internal anchor
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 61 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 56 · internal anchor
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
DanceGRPO: Unleashing GRPO on Visual Generation cs.CV · 2025-05-12 · unverdicted · none · ref 5 · internal anchor
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
Improving Video Generation with Human Feedback cs.CV · 2025-01-23 · unverdicted · none · ref 45 · internal anchor
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows cs.CV · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
PixelFlowCast delivers high-fidelity precipitation nowcasts from radar sequences using a latent-free Pixel Mean Flows predictor guided by a deterministic coarse stage and KANCondNet features.
Pose-Aware Diffusion for 3D Generation cs.CV · 2026-05-01 · unverdicted · none · ref 24 · internal anchor
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation cs.CV · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset cs.CV · 2025-05-14 · conditional · none · ref 17 · internal anchor
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 18 · internal anchor
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
Seedream 3.0 Technical Report cs.CV · 2025-04-15 · unverdicted · none · ref 12 · internal anchor
Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware timestep sampling for 4-8x faster inference at up to 2K resolution.
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation cs.CV · 2025-01-21 · unverdicted · none · ref 53 · internal anchor
Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unreviewed · ref 48 · 2 links · internal anchor
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unreviewed · ref 44 · internal anchor

Flow Matching for Generative Modeling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer