Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
Scaling rectified flow transformers for high-resolution image synthesis
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8representative citing papers
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
StitchVM stitches clean-image reward models with diffusion backbones to enable efficient value estimation for noisy latents, speeding up diffusion alignment methods like DPS by 3.2x and halving memory.
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
citing papers explorer
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
-
Stitched Value Model for Diffusion Alignment
StitchVM stitches clean-image reward models with diffusion backbones to enable efficient value estimation for noisy latents, speeding up diffusion alignment methods like DPS by 3.2x and halving memory.
-
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
-
Generate Your Talking Avatar from Video Reference
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
-
Nucleus-Image: Sparse MoE for Image Generation
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.