hub

Scalable diffusion models with transformers

William Peebles, Saining Xie

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

cs.CV · 2026-02-05 · unverdicted · novelty 7.0

DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

MaTe proposes a training-free diffusion transformer that performs material transfer using only images by integrating them at the token level for unified multi-modal attention in a shared latent space.

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.

Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

cs.GR · 2026-03-25 · conditional · novelty 6.0

Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.

SegviGen: Repurposing 3D Generative Model for Part Segmentation

cs.CV · 2026-03-17 · unverdicted · novelty 6.0

SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.

Native and Compact Structured Latents for 3D Generation

cs.CV · 2025-12-16 · unverdicted · novelty 6.0

Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.

FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers

cs.LG · 2025-12-08 · conditional · novelty 6.0

FlowLPS perturbs flow-model estimates with Langevin steps then applies proximal refinement to balance fidelity and perceptual quality on linear inverse problems.

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

cs.CV · 2025-11-24 · conditional · novelty 6.0

DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.

AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

cs.CV · 2026-04-17 · unverdicted · novelty 4.0

Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.

citing papers explorer

Showing 13 of 13 citing papers.

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching cs.CV · 2026-02-05 · unverdicted · none · ref 49
DisCa replaces heuristic feature caching with a lightweight learnable neural predictor compatible with distillation, achieving 11.8× acceleration on video diffusion transformers with preserved generation quality.
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing cs.CV · 2026-05-17 · unverdicted · none · ref 38
HierEdit enables efficient 4K image editing via low-resolution proxy localization followed by hierarchical local-window diffusion that reuses unaltered regions as conditioning.
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 31
MaTe proposes a training-free diffusion transformer that performs material transfer using only images by integrating them at the token level for unified multi-modal attention in a shared latent space.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 35
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation cs.CV · 2026-04-21 · unverdicted · none · ref 44
Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while scaling to text-to-image tasks.
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning cs.GR · 2026-03-25 · conditional · none · ref 26
Realiz3D decouples visual domain from 3D controls in diffusion models via domain-aware residual adapters to enable photorealistic controllable generation.
SegviGen: Repurposing 3D Generative Model for Part Segmentation cs.CV · 2026-03-17 · unverdicted · none · ref 48
SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
Native and Compact Structured Latents for 3D Generation cs.CV · 2025-12-16 · unverdicted · none · ref 49
Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.
FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers cs.LG · 2025-12-08 · conditional · none · ref 20
FlowLPS perturbs flow-model estimates with Langevin steps then applies proximal refinement to balance fidelity and perceptual quality on linear inverse problems.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation cs.CV · 2025-11-24 · conditional · none · ref 43
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes cs.CV · 2026-05-22 · unverdicted · none · ref 45
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 19
A two-stage method predicts an intermediate Canny map for structure then renders the image conditioned on appearance and structure, paired with a 100k text-aware dataset, to improve detail preservation in subject-driven generation.
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations cs.CV · 2026-04-17 · unverdicted · none · ref 39
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.

Scalable diffusion models with transformers

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer