Depth Pro is a fast foundation model for zero-shot metric monocular depth estimation that produces sharp high-resolution depth maps with absolute scale using a multi-scale vision transformer.
Swin transformer: Hierarchical vision transformer using shifted windows
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2representative citing papers
PixArt-α matches commercial text-to-image quality with a diffusion transformer trained in 675 A100 GPU days through decomposed training stages, cross-attention text injection, and vision-language model dense captions.
citing papers explorer
-
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Depth Pro is a fast foundation model for zero-shot metric monocular depth estimation that produces sharp high-resolution depth maps with absolute scale using a multi-scale vision transformer.
-
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
PixArt-α matches commercial text-to-image quality with a diffusion transformer trained in 675 A100 GPU days through decomposed training stages, cross-attention text injection, and vision-language model dense captions.