VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
hub Canonical reference
Scaling recti- fied flow transformers for high-resolution image synthesis
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.
Face2Scene uses facial restoration as an oracle to derive degradation codes that condition a diffusion model for restoring the entire degraded scene.
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.
MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
citing papers explorer
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming baselines on a new PAd1M dataset.
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
YOSE accelerates DiT video object removal up to 2.5x by using BVI for adaptive token selection and DiffSim to simulate unmasked token effects, while preserving visual quality.
-
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
-
OneHOI: Unifying Human-Object Interaction Generation and Editing
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
-
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.
-
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Face2Scene uses facial restoration as an oracle to derive degradation codes that condition a diffusion model for restoring the entire degraded scene.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
ATATA: One Algorithm to Align Them All
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
-
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.
-
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
LivingSwap is the first video reference-guided face swapping model that uses keyframe conditioning and temporal stitching to preserve source video realism with high fidelity across long sequences.
-
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
One-to-All Animation enables alignment-free character animation and image pose transfer via self-supervised outpainting reformulation, reference extraction, hybrid fusion attention, identity-robust pose control, and token replacement for long videos.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.
-
Repurposing 3D Generative Model for Autoregressive Layout Generation
LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.
-
Towards Robust Content Watermarking Against Removal and Forgery Attacks
ISTS watermarking dynamically controls injection based on prompt semantics and uses two-sided detection to resist removal and forgery attacks in diffusion models.
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
-
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.
-
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.
-
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
GaussianDWM uses 3D Gaussians with embedded linguistic features, language-guided sampling, and dual-condition generation for unified scene understanding and multi-modal output in driving world models.
-
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
KeyTailor improves video virtual try-on realism by using instruction-guided keyframes to enhance garment details and background integrity in DiT models without major architectural changes.
-
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.
-
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images
A technique reconstructs large urban areas from sparse extreme off-nadir satellite images by modeling geometry as a Z-monotonic 2.5D height map SDF and applying a generative network to restore plausible textures on the resulting mesh.
-
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrative dynamism in sequential image generation.
-
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
FRAMER improves real-world super-resolution by decomposing features into low- and high-frequency bands via FFT, applying intra- and inter-contrastive losses with adaptive modulators, and using the final layer as teacher for intermediate layers during diffusion denoising.
-
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Instance-level sampling schedules optimized via REINFORCE with James-Stein estimator improve text-to-image alignment and allow 5-step Flux generation to match deliberately distilled samplers.
-
simpleposter: a simple baseline for product poster generation
SimplePoster achieves 98.7% subject preservation and improved text accuracy in product posters via full-parameter fine-tuning of an inpainting model and zero-cost character-level position encoding, outperforming complex baselines like SeedEdit 3.0.
- LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models