Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al · 2025 · arXiv 2510.07316

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

SurGe: Improved Surface Geometry in Point Maps

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

cs.CV · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.

Modality Forcing for Scalable Spatial Generation

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 4 refs

GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.

citing papers explorer

Showing 7 of 7 citing papers after filters.

SurGe: Improved Surface Geometry in Point Maps cs.CV · 2026-05-29 · unverdicted · none · ref 58
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 49 · 2 links
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
Modality Forcing for Scalable Spatial Generation cs.CV · 2026-06-11 · unverdicted · none · ref 48
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation cs.CV · 2026-06-01 · unverdicted · none · ref 31
MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 14 · 4 links
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching cs.CV · 2026-05-29 · unverdicted · none · ref 87
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
WildPose: A Unified Framework for Robust Pose Estimation in the Wild cs.CV · 2026-05-12 · unverdicted · none · ref 52
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.

Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer