TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
hub
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 13roles
background 3polarities
background 3representative citing papers
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
citing papers explorer
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
-
How to Spin an Object: First, Get the Shape Right
Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
-
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
-
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
-
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
- Image Generators are Generalist Vision Learners