hub

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, Ying-Cong Chen · 2025 · arXiv 2409.18124

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.

CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.

How to Spin an Object: First, Get the Shape Right

cs.CV · 2024-12-13 · unverdicted · novelty 7.0

Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.

UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.

AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.

Modality Forcing for Scalable Spatial Generation

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.

Open-Source Image Editing Models Are Zero-Shot Vision Learners

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

cs.CV · 2026-05-01 · unverdicted · novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

Diffusion Model as a Generalist Segmentation Learner

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.

Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

cs.CV · 2026-03-11 · unverdicted · novelty 6.0

Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

cs.CV · 2026-02-08 · unverdicted · novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

cs.CV · 2025-11-30 · unverdicted · novelty 6.0

Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

cs.CV · 2025-08-20 · unverdicted · novelty 6.0

Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.

Towards Consistent Video Geometry Estimation

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.

The Midas Touch for Metric Depth

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

cs.CV · 2025-01-05 · unverdicted · novelty 5.0

DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · 2 refs

citing papers explorer

Showing 19 of 19 citing papers after filters.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 21
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
MUSE: Unlocking Timestep as Native Task Steering for One-Step Dense Prediction cs.CV · 2026-06-29 · unverdicted · none · ref 10
MUSE shows that the native timestep embedding in diffusion models acts as a parameter-free steering signal for multi-task monocular depth and normal estimation via manifold decoupling in latent space.
CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation cs.CV · 2026-04-13 · unverdicted · none · ref 23
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
How to Spin an Object: First, Get the Shape Right cs.CV · 2024-12-13 · unverdicted · none · ref 14
Camera-Relative Object Coordinates (CROCS) as an intermediate geometry representation in two-stage image-to-3D models yields superior novel-view quality, geometric accuracy, and multiview consistency over depth maps, visual features, and other pointmap alternatives.
PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation cs.CV · 2026-07-02 · unverdicted · none · ref 3
PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.
UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception cs.CV · 2026-06-29 · unverdicted · none · ref 9
UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.
AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World cs.CV · 2026-06-29 · unverdicted · none · ref 26
AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.
Modality Forcing for Scalable Spatial Generation cs.CV · 2026-06-11 · unverdicted · none · ref 17
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
Open-Source Image Editing Models Are Zero-Shot Vision Learners cs.CV · 2026-05-06 · unverdicted · none · ref 19
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors cs.CV · 2026-05-01 · unverdicted · none · ref 42
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
Diffusion Model as a Generalist Segmentation Learner cs.CV · 2026-04-27 · unverdicted · none · ref 34
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion cs.CV · 2026-03-11 · unverdicted · none · ref 22
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 32
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model cs.CV · 2025-11-30 · unverdicted · none · ref 22
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering cs.CV · 2025-08-20 · unverdicted · none · ref 21
Ouroboros uses two single-step diffusion models with cycle consistency for forward and inverse rendering, extending intrinsic decomposition to indoor/outdoor scenes with faster inference than multi-step methods.
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching cs.CV · 2026-05-29 · unverdicted · none · ref 26
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
Towards Consistent Video Geometry Estimation cs.CV · 2026-05-28 · unverdicted · none · ref 23
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
The Midas Touch for Metric Depth cs.CV · 2026-05-12 · unverdicted · none · ref 20
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation cs.CV · 2025-01-05 · unverdicted · none · ref 20
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer