pith. sign in

arxiv: 2606.00386 · v1 · pith:EXMW2KJAnew · submitted 2026-05-29 · 💻 cs.CV

{α}Depth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

Pith reviewed 2026-06-28 22:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo conversionsoft boundary decompositionlayered depthalpha representationdepth estimationcomputer visionimage matting
0
0 comments X

The pith

αDepth decomposes soft boundaries via layered color and depth estimates plus circular alpha representation for accurate stereo conversion without manual guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the ambiguity at soft boundaries such as hair and defocus blur during stereo conversion from single images. It proposes estimating separate layered color and depth values at those boundaries instead of single-layer depth predictions. To manage complex scenes with multiple targets, the method introduces Circular Alpha Representation that performs local boundary decomposition rather than global foreground extraction. This design supports single-pass, automatic scene-level inference. If correct, the result is stereo output free of background bleeding and structural distortions at fuzzy edges.

Core claim

αDepth is a layered representation that resolves mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries and employs Circular Alpha Representation (CAR) to shift from global target extraction to local boundary decomposition, enabling efficient scene-level inference without manual guidance.

What carries the argument

Circular Alpha Representation (CAR), a local boundary decomposition method that replaces global target extraction to support multi-target scene inference.

If this is right

  • Eliminates background bleeding at soft boundaries in stereo output.
  • Removes structural distortions at soft boundaries.
  • Supports single-pass processing in complex multi-target scenes.
  • Removes the need for user intervention required by prior matting methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local decomposition approach could extend to temporal consistency checks in video stereo conversion.
  • Layered boundary handling may improve related tasks such as instance-aware image synthesis.
  • Replacing global matting with boundary-focused representations might apply to other vision problems involving transparency.

Load-bearing premise

That estimating layered color and depth values combined with CAR enables accurate scene-level inference without manual guidance even in complex multi-target scenes.

What would settle it

Stereo conversion results on a test set of complex multi-object scenes that still exhibit background bleeding or structural distortions at soft boundaries.

Figures

Figures reproduced from arXiv: 2606.00386 by Christopher Schroers, Karlis Martins Briedis, Lukas Mehl, Markus Gross, Xiang Zhang, Yang Zhang.

Figure 1
Figure 1. Figure 1: Layered αDepth Representation. We introduce αDepth to decompose soft boundaries (e.g., hair, thin structures, and defocus blur) for high-fidelity stereo conversion. Given an image and its depth map as inputs, our approach estimates layered information, i.e., alpha, foreground/background (FG/BG) colors and depths, at local soft boundaries (see non-zero alpha regions), enabling scene-level inference of multi… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison with existing paradigms. Depth estimation models typically assign a single depth value per pixel, struggling with mixed colors at soft boundaries and suffering from depth ambiguity. While conventional matting approaches extract instance-level soft boundaries, they usually require manual guidance (e.g., trimaps). In contrast, our layered αDepth representation enables automatic scene-level decompo… view at source ↗
Figure 3
Figure 3. Figure 3: Challenges of soft boundary recovery in stereo conversion. (a) We evaluate warping performance via Epipolar Plane Images (EPIs) extracted along the gray dashed line under uniform rightward camera motion. Direct warping with Video Depth Anything [3] struggles with depth ambiguity at soft boundaries, causing broken edges and flying pixels. Although HairGuard [55] captures better details, its single-layer dep… view at source ↗
Figure 4
Figure 4. Figure 4: αDepth estimation pipeline. Given an image and its corresponding depth map (e.g., from a pre-trained depth model), we employ a dual-path encoder to extract both semantic and detail features. A multi-branch decoder then processes these features for task-specific predictions. Finally, we apply circular alpha decoding to generate the estimated alpha map, which subsequently modulates and constrains the layered… view at source ↗
Figure 5
Figure 5. Figure 5: Circular Alpha Representation (CAR). The vanilla alpha representation inherently suffers from sharp discontinuities at the intersecting boundaries of multiple overlapping instances. By contrast, CAR encodes the ground-truth alpha into continuous trigonometric space during training, benefiting model optimization and eliminating alpha valley issues (Fig. 3b). During inference, the predicted trigonometric com… view at source ↗
Figure 6
Figure 6. Figure 6: Training Data Curation. Firstly, the alpha map is processed via circular alpha encoding to yield continuous alpha labels (αsin, αcos) and thresholded to produce layered masks (MFG, MBG). In layered color/depth generation, foreground and background assets are composited to form the synthesized input image (IIN) and depth (DIN). Concurrently, masked blending is applied to generate ground-truth color layers (… view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparisons with HairGuard [55] in warping and stereo conversion [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparisons with alpha matting methods [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Stability comparisons between vanilla alpha representation and circular alpha representa￾tion (CAR). Vanilla alpha representation often suffers from alpha valley issues and thus produces unstable results. By contrast, our CAR shows consistent performance when processing video inputs. Regions outside α ∈ [0.02, 0.98] are masked out for better comparison. to test the robustness of our αDepth method with 10 t… view at source ↗
Figure 10
Figure 10. Figure 10: Alpha estimation performance under different depth inputs. We generate input depth using state-of-the-art models, including Depth Anything V2 (DAv2) [45], Depth Pro (DPro) [2], Pixel-Perfect Depth (PPD) [42], and MoGe-2 [40]. Despite different characteristics exhibited in depth inputs, our αDepth shows stable performance in alpha estimation and soft boundary detail extraction [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 11
Figure 11. Figure 11: Warping performance under large viewpoint changes. This example employs the camera motion (arc left with rotation) from ReCamMaster [1]. Due to depth ambiguity in soft boundary regions, the warping results using the original depth from Depth Anything V2 [45] often contain broken structures. Although HairGuard refines depth to better preserve soft boundary details [55], its results often suffer from backgr… view at source ↗
Figure 12
Figure 12. Figure 12: Visual comparisons of ablation models in Tab. 4. F Visualization of Ablation Models [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of αDepth results on Marvel-10K dataset [55]. for instance-level inference. While auxiliary-free methods like GVM [8] reduce user effort, they are typically optimized for specific semantic categories and struggle to generalize to the diverse types of soft boundaries in complex scenes. As demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparisons in warping and stereo conversion, part one. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual comparisons in warping and stereo conversion, part two. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visual comparisons with alpha matting methods. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existing depth models primarily predict single-layer depth, leading to ambiguity in depth correspondence at soft boundaries. While matting techniques can capture opacity for layered modeling, they often struggle in complex scenes with multiple targets and usually require user intervention. This paper introduces {\alpha}Depth, a layered representation that decomposes soft boundaries for high-fidelity stereo conversion. Specifically, we first resolve mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries. Considering complex multi-target scenes, we design a Circular Alpha Representation (CAR) that shifts the paradigm from global target extraction to local boundary decomposition. Unlike prior matting methods restricted to a single foreground/background, CAR enables efficient scene-level inference without manual guidance. Extensive evaluations demonstrate that {\alpha}Depth achieves state-of-the-art performance in stereo conversion, eliminating background bleeding and structural distortions at soft boundaries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces αDepth, a layered representation for stereo conversion that resolves mixed color and depth ambiguity at soft boundaries via layered estimation and proposes Circular Alpha Representation (CAR) to shift from global extraction to local boundary decomposition, enabling guidance-free scene-level inference in multi-target scenes and claiming SOTA performance with elimination of bleeding and distortions.

Significance. If the central claims hold, the work would offer a meaningful advance for stereo conversion by addressing soft-boundary artifacts without manual intervention, with potential impact on applications requiring accurate layered depth in complex scenes.

major comments (2)
  1. [Abstract] Abstract: the claim that CAR 'enables efficient scene-level inference without manual guidance' in complex multi-target scenes rests on the untested premise that independent local decompositions compose without residual ambiguity or bleeding; no derivation or bound on composition error is supplied.
  2. [Abstract] Abstract: assertions of 'state-of-the-art performance' and 'eliminating background bleeding and structural distortions' are unsupported by any quantitative results, baselines, error metrics, or evaluation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point-by-point below and indicate planned revisions to the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that CAR 'enables efficient scene-level inference without manual guidance' in complex multi-target scenes rests on the untested premise that independent local decompositions compose without residual ambiguity or bleeding; no derivation or bound on composition error is supplied.

    Authors: CAR is explicitly formulated for local boundary decomposition so that each soft boundary can be processed independently; the full manuscript shows through multi-target scene experiments that these local results compose into coherent scene-level layered representations without guidance. We agree, however, that the manuscript supplies no formal derivation or composition-error bound. We will revise the abstract to frame the claim as empirically validated rather than theoretically guaranteed. revision: partial

  2. Referee: [Abstract] Abstract: assertions of 'state-of-the-art performance' and 'eliminating background bleeding and structural distortions' are unsupported by any quantitative results, baselines, error metrics, or evaluation details.

    Authors: The abstract summarizes results that are quantified in the manuscript (comparisons to prior depth and matting baselines, PSNR/SSIM and perceptual metrics on stereo conversion, and visual ablation of bleeding/distortion). To address the concern that the abstract itself lacks supporting detail, we will revise it to reference the key quantitative findings and evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity detectable; derivation chain not shown

full rationale

The provided abstract and context contain no equations, derivations, parameter fits, or self-citations. CAR is introduced descriptively as a shift to local decomposition, but no mathematical steps are exhibited that could reduce a claimed prediction or uniqueness result to its own inputs by construction. Per the hard rules, absence of quotable reductions means score 0 and empty steps list.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract provided; no free parameters, axioms, or invented entities can be extracted beyond the named CAR representation.

invented entities (1)
  • Circular Alpha Representation (CAR) no independent evidence
    purpose: Shifts from global target extraction to local boundary decomposition for multi-target scenes
    New design element introduced to enable single-pass scene-level inference without manual guidance

pith-pipeline@v0.9.1-grok · 5718 in / 923 out tokens · 20071 ms · 2026-06-28T22:35:07.208469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025

  2. [2]

    Depth pro: Sharp monocular metric depth in less than a second

    Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InICLR, 2025

  3. [3]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InCVPR, pages 22831–22840, 2025

  4. [4]

    Svg: 3d stereoscopic video generation via denoising frame matrix.arXiv preprint arXiv:2407.00367, 2024

    Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix.arXiv preprint arXiv:2407.00367, 2024

  5. [5]

    Boosting robustness of image matting with context assembling and strong data augmentation

    Yutong Dai, Brian Price, He Zhang, and Chunhua Shen. Boosting robustness of image matting with context assembling and strong data augmentation. InCVPR, pages 11707–11716, 2022

  6. [6]

    Simoncelli

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE TPAMI, 44(5):2567–2581, 2022

  7. [7]

    Cat3d: Create anything in 3d with multi-view diffusion models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srini- vasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. InNeurIPS, 2024

  8. [8]

    Generative video matting

    Yongtao Ge, Kangyang Xie, Guangkai Xu, Li Ke, Mingyu Liu, Longtao Huang, Hui Xue, Hao Chen, and Chunhua Shen. Generative video matting. InSIGGRAPH, pages 1–10, 2025

  9. [9]

    Eye2eye: A simple approach for monocular-to-stereo video synthesis.arXiv preprint arXiv:2505.00135, 2025

    Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, and Noah Snavely. Eye2eye: A simple approach for monocular-to-stereo video synthesis.arXiv preprint arXiv:2505.00135, 2025

  10. [10]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InCVPR, pages 9492–9502, 2024

  11. [11]

    Modnet: Real-time trimap-free portrait matting via objective decomposition

    Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Rynson WH Lau. Modnet: Real-time trimap-free portrait matting via objective decomposition. InAAAI, volume 36, pages 1140–1147, 2022

  12. [12]

    Zim: Zero-shot image matting for anything

    Beomyoung Kim, Chanyong Shin, Joonhyun Jeong, Hyungsik Jung, Se-Yun Lee, Sewhan Chun, Dong- Hyun Hwang, and Joonsang Yu. Zim: Zero-shot image matting for anything. InICCV, pages 23828–23838, 2025

  13. [13]

    Matting anything

    Jiachen Li, Jitesh Jain, and Humphrey Shi. Matting anything. InCVPR, pages 1775–1785, 2024

  14. [14]

    PhD thesis, University of Sydney, 2020

    Jizhizi Li.End-to-end Animal Matting. PhD thesis, University of Sydney, 2020

  15. [15]

    Privacy-preserving portrait matting

    Jizhizi Li, Sihan Ma, Jing Zhang, and Dacheng Tao. Privacy-preserving portrait matting. InACMMM, pages 3501–3509, 2021

  16. [16]

    Deep automatic natural image matting

    Jizhizi Li, Jing Zhang, and Dacheng Tao. Deep automatic natural image matting. InIJCAI. International Joint Conferences on Artificial Intelligence Organization, 2021

  17. [17]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018

  18. [18]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  19. [19]

    Robust high-resolution video matting with temporal guidance

    Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with temporal guidance. InWACV, pages 238–247, 2022

  20. [20]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169, 2024

  21. [21]

    What is the fractional laplacian? a comparative review with new results.Journal of Computational Physics, 404:109009, 2020

    Anna Lischke, Guofei Pang, Mamikon Gulian, Fangying Song, Christian Glusa, Xiaoning Zheng, Zhiping Mao, Wei Cai, Mark M Meerschaert, Mark Ainsworth, et al. What is the fractional laplacian? a comparative review with new results.Journal of Computational Physics, 404:109009, 2020. 10

  22. [22]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  23. [23]

    Stereo conversion with disparity- aware warping, compositing and inpainting

    Lukas Mehl, Andrés Bruhn, Markus Gross, and Christopher Schroers. Stereo conversion with disparity- aware warping, compositing and inpainting. InWACV, pages 4260–4269, 2024

  24. [24]

    Elastic3d: Controllable stereo video conversion with guided latent decoding

    Nando Metzger, Prune Truong, Goutam Bhat, Konrad Schindler, and Federico Tombari. Elastic3d: Controllable stereo video conversion with guided latent decoding. InCVPR, 2026

  25. [25]

    Softmax splatting for video frame interpolation

    Simon Niklaus and Feng Liu. Softmax splatting for video frame interpolation. InCVPR, pages 5437–5446, 2020

  26. [26]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  27. [27]

    Matteformer: Transformer- based image matting via prior-tokens

    GyuTae Park, SungJoon Son, JaeYoung Yoo, SeHo Kim, and Nojun Kwak. Matteformer: Transformer- based image matting via prior-tokens. InCVPR, pages 11696–11706, 2022

  28. [28]

    UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025

  29. [29]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InCVPR, pages 10106–10116, 2024

  30. [30]

    Attention-guided hierarchical structure aggregation for image matting

    Yu Qiao, Yuhao Liu, Xin Yang, Dongsheng Zhou, Mingliang Xu, Qiang Zhang, and Xiaopeng Wei. Attention-guided hierarchical structure aggregation for image matting. InCVPR, June 2020

  31. [31]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, pages 12179–12188, 2021

  32. [32]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.PAMI, 44(3):1623–1637, 2020

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.PAMI, 44(3):1623–1637, 2020

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  34. [34]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  35. [35]

    Stereopilot: Learning unified and efficient stereo conversion via generative priors.arXiv preprint arXiv:2512.16915, 2025

    Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, and Ying-Cong Chen. Stereopilot: Learning unified and efficient stereo conversion via generative priors.arXiv preprint arXiv:2512.16915, 2025

  36. [36]

    M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion.arXiv preprint arXiv:2505.16565, 2025

    Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, and Federico Tombari. M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion.arXiv preprint arXiv:2505.16565, 2025

  37. [37]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

  38. [38]

    Stereodiffusion: Training-free stereo image generation using latent diffusion models

    Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. InCVPR, pages 7416–7425, 2024

  39. [39]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, pages 5261–5271, 2025

  40. [40]

    Moge-2: Accurate monocular geometry with metric scale and sharp details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. In NIPS, 2025

  41. [41]

    Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks

    Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. InEuropean conference on computer vision, pages 842–857. Springer, 2016. 11

  42. [42]

    Pixel-perfect depth with semantics-prompted diffusion transformers

    Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, and Xin Yang. Pixel-perfect depth with semantics-prompted diffusion transformers. InNIPS, 2025

  43. [43]

    Deep image matting

    Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2970–2979, 2017

  44. [44]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024

  45. [45]

    Depth anything v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeurIPS, 2024

  46. [46]

    Matanyone 2: Scaling video matting via a learned quality evaluator

    Peiqing Yang, Shangchen Zhou, Kai Hao, and Qingyi Tao. Matanyone 2: Scaling video matting via a learned quality evaluator. InCVPR, 2026

  47. [47]

    Matanyone: Stable video matting with consistent memory propagation

    Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video matting with consistent memory propagation. InCVPR, pages 7299–7308, 2025

  48. [48]

    Vitmatte: Boosting image matting with pre-trained plain vision transformers.Information Fusion, 103:102091, 2024

    Jingfeng Yao, Xinggang Wang, Shusheng Yang, and Baoyuan Wang. Vitmatte: Boosting image matting with pre-trained plain vision transformers.Information Fusion, 103:102091, 2024

  49. [49]

    Diversedepth: Affine-invariant depth prediction using diverse data.arXiv preprint arXiv:2002.00569, 2020

    Wei Yin, Xinlong Wang, Chunhua Shen, Yifan Liu, Zhi Tian, Songcen Xu, Changming Sun, and Dou Renyin. Diversedepth: Affine-invariant depth prediction using diverse data.arXiv preprint arXiv:2002.00569, 2020

  50. [50]

    Mono2stereo: A benchmark and empirical study for stereo conversion

    Songsong Yu, Yuxin Chen, Zhongang Qi, Zeke Xie, Yifan Wang, Lijun Wang, Ying Shan, and Huchuan Lu. Mono2stereo: A benchmark and empirical study for stereo conversion. InCVPR, pages 21847–21856, 2025

  51. [51]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  52. [52]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, pages 586–595, 2018

  53. [53]

    Betterdepth: Plug-and-play diffusion refiner for zero-shot monocular depth estimation

    Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, and Christopher Schroers. Betterdepth: Plug-and-play diffusion refiner for zero-shot monocular depth estimation. InNeurIPS, 2024

  54. [54]

    High-fidelity novel view synthesis via splatting-guided diffusion

    Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, and Christopher Schroers. High-fidelity novel view synthesis via splatting-guided diffusion. InSIGGRAPH, SIGGRAPH Conference Papers ’25, New York, NY , USA, 2025. Association for Computing Machinery

  55. [55]

    Guardians of the hair: Rescuing soft boundaries in depth, stereo, and novel views

    Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, and Christopher Schroers. Guardians of the hair: Rescuing soft boundaries in depth, stereo, and novel views. InCVPR, 2026

  56. [56]

    Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos.arXiv preprint arXiv:2409.07447, 2024

    Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos.arXiv preprint arXiv:2409.07447, 2024

  57. [57]

    Stereo magnification: learning view synthesis using multiplane images.ACM Trans

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Trans. Graph., 37(4), July 2018. 12 Appendix We provide more technical details, experimental results, ablation studies, and qualitative visualizations to support the contributions of ourαDepth approach. Detail...