TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
hub
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 20roles
background 3polarities
background 3representative citing papers
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.
PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.
UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.
AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
citing papers explorer
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
-
CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
-
How to Spin an Object: First, Get the Shape Right
Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.
-
PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation
PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.
-
UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception
UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.
-
AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World
AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
-
Modality Forcing for Scalable Spatial Generation
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
-
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
-
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
-
Towards Consistent Video Geometry Estimation
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.