Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al · 2024

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.

Stitched Value Model for Diffusion Alignment

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

StitchVM stitches clean-image reward models with diffusion backbones to enable efficient value estimation for noisy latents, speeding up diffusion alignment methods like DPS by 3.2x and halving memory.

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.

Generate Your Talking Avatar from Video Reference

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.

Nucleus-Image: Sparse MoE for Image Generation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

citing papers explorer

Showing 8 of 8 citing papers.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 20 · 2 links
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment cs.LG · 2026-05-09 · unverdicted · none · ref 2 · 2 links
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
Stitched Value Model for Diffusion Alignment cs.CV · 2026-05-19 · unverdicted · none · ref 67
StitchVM stitches clean-image reward models with diffusion backbones to enable efficient value estimation for noisy latents, speeding up diffusion alignment methods like DPS by 3.2x and halving memory.
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution cs.CV · 2026-05-13 · unverdicted · none · ref 27
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
Generate Your Talking Avatar from Video Reference cs.CV · 2026-04-30 · unverdicted · none · ref 15
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 27
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 30 · 2 links
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 37
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Scaling rectified flow transformers for high-resolution image synthesis

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer