hub Mixed citations

DDT: Decoupled diffusion Transformer

Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang · 2025 · arXiv 2504.05741

Mixed citation behavior. Most common role is background (50%).

15 Pith papers citing it

Background 50% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 3 method 1

citation-polarity summary

background 4 baseline 3 use method 1

representative citing papers

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

cs.CV · 2025-11-24 · conditional · novelty 6.0

DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.

Back to Basics: Let Denoising Generative Models Denoise

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.

Improved Mean Flows: On the Challenges of Fastforward Generative Models

cs.CV · 2025-12-01 · unverdicted · novelty 5.0

Improved MeanFlow (iMF) reaches 1.72 FID on ImageNet 256x256 with one function evaluation by reformulating the training objective as a regression on instantaneous velocity and treating guidance as flexible conditioning variables.

Motif-Video 2B: Technical Report

cs.CV · 2026-04-14 · unverdicted · novelty 4.0 · 2 refs

Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

cs.CV · 2026-05-18

citing papers explorer

Showing 15 of 15 citing papers.

Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 46
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 68
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 81
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 40
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unverdicted · none · ref 34
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 85
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 75
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 62
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 71
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 28
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 67
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation cs.LG · 2026-04-14 · unverdicted · none · ref 58
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
Improved Mean Flows: On the Challenges of Fastforward Generative Models cs.CV · 2025-12-01 · unverdicted · none · ref 53
Improved MeanFlow (iMF) reaches 1.72 FID on ImageNet 256x256 with one function evaluation by reformulating the training objective as a regression on instantaneous velocity and treating guidance as flexible conditioning variables.
Motif-Video 2B: Technical Report cs.CV · 2026-04-14 · unverdicted · none · ref 40 · 2 links
Motif-Video 2B reaches 83.76% on VBench, outperforming a 14B-parameter model with 7x fewer parameters and far less training data through shared cross-attention and a three-part backbone.
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation cs.CV · 2026-05-18 · unreviewed · ref 31

DDT: Decoupled diffusion Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer