SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 7years
2026 7verdicts
UNVERDICTED 7roles
dataset 1polarities
use dataset 1representative citing papers
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
citing papers explorer
-
SurGe: Improved Surface Geometry in Point Maps
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
-
Modality Forcing for Scalable Spatial Generation
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
-
Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation
MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
-
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
-
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.