pith. machine review for the scientific record. sign in

arxiv: 2604.11097 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular depth estimationpolarization imagingdiffusion modelscross-modal fusionsurface normal predictionreflective surfacescomputer vision
0
0 comments X

The pith

A diffusion model fuses RGB images with polarization measurements through a learnable gate to improve depth estimates on reflective and transparent surfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a diffusion-based approach to single-image depth estimation that adds physically measured polarization signals to standard RGB input. It encodes both modalities into one latent space and uses a trainable gating unit to blend them so that noisy polarization data is down-weighted while useful cues around reflections and transparency are kept. Experiments on synthetic and real datasets show the fused model beats pure RGB baselines in difficult regions and stays comparable elsewhere. The same architecture extends to surface normal estimation with only small changes. The central idea is that polarization supplies independent physical constraints that RGB alone cannot provide in ambiguous scenes.

Core claim

CDPR encodes RGB and polarization (AoLP/DoLP) images into a shared latent space with a pre-trained VAE, then applies a learnable confidence-aware gating mechanism to produce an integrated representation that is denoised into a depth map; this selectively suppresses noisy polarization signals while retaining informative cues near reflective or transparent surfaces.

What carries the argument

The learnable confidence-aware gating mechanism that adaptively fuses RGB and polarization latent features before the diffusion denoising steps.

If this is right

  • Depth accuracy improves in textureless, transparent, and specular regions compared with RGB-only diffusion models.
  • Performance remains competitive with RGB-only models on ordinary scenes.
  • The same gated latent fusion works for surface normal prediction after minimal retraining.
  • The framework supports other polarization-guided dense prediction tasks without redesign of the core diffusion pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gating could let polarization improve other single-image tasks such as segmentation or optical flow in adverse lighting.
  • If the polarization cues prove robust across cameras, the method could reduce reliance on active sensors like LiDAR for indoor or automotive depth.
  • The latent-space fusion pattern may generalize to other cross-modal pairs where one modality is noisier than the other.

Load-bearing premise

Polarization measurements supply stable physical cues in hard regions that a gating network can selectively use without adding new errors or needing extra calibration.

What would settle it

A controlled test on real scenes with ground-truth depth in highly specular or transparent areas that shows the gated model produces larger errors than an RGB-only diffusion baseline.

Figures

Figures reproduced from arXiv: 2604.11097 by Cuiwei Liu, Hao Wang, Rongjia Yu, Tong Jia, Xiaofang Li, Xiao Yang, Zinuo Zhang.

Figure 1
Figure 1. Figure 1: RGB Methods (e.g. Marigold [19]) suffers from scale ambiguity in low-texture regions, such as the table surface, where the predicted distance to the background wall is signifi￾cantly overestimated. While this error may have limited impact on perceptual quality even standard metrics, it is substantially amplified in downstream tasks like 3D reconstruction, leading to distorted geometry and missing surfaces.… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our training pipeline. The input consists of an RGB image and its corresponding polarization observations, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Network architecture of the proposed Confidence [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of depth estimation results on HyperPol dataset. Compared to prior methods, our method [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of depth estimation results on HAMMER dataset. Compared to the more complex HyperPol [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of depth estimation results on in-the-wild examples. As highlighted in the red boxes, RGB-only [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of our extended experiments on surface normal estimation. The results show that our method remains [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of RGB input, polarization cues, and [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Monocular depth estimation is a fundamental yet challenging task in computer vision, especially under complex conditions such as textureless surfaces, transparency, and specular reflections. Recent diffusion-based approaches have significantly advanced performance by reformulating depth prediction as a denoising process in the latent space. However, existing methods rely solely on RGB inputs, which often lack sufficient cues in challenging regions. In this work, we present CDPR - Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation - a novel diffusion-based framework that integrates physically grounded polarization priors to enhance estimation robustness. Specifically, we encode both RGB and polarization (AoLP/DoLP) images into a shared latent space via a pre-trained Variational Autoencoder (VAE), and dynamically fuse multi-modal information through a learnable confidence-aware gating mechanism. This fusion module adaptively suppresses noisy signals in polarization inputs while preserving informative cues, particularly around reflective or transparent surfaces, and provides the integrated latent representation for subsequent monocular depth estimation. Beyond depth estimation, we further verify that our framework can be easily generalized to surface normal prediction with minimal modification, showcasing its scalability to general polarization-guided dense prediction tasks. Experiments on both synthetic and real-world datasets validate that CDPR significantly outperforms RGB-only baselines in challenging regions while maintaining competitive performance in standard scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CDPR, a cross-modal diffusion framework for monocular depth estimation that encodes RGB and polarization (AoLP/DoLP) images into a shared latent space using a pre-trained VAE, fuses them via a learnable confidence-aware gating mechanism to suppress noise while preserving cues in challenging regions (textureless, specular, transparent surfaces), and performs depth prediction as a denoising process. It claims significant outperformance over RGB-only baselines on synthetic and real datasets in difficult areas while remaining competitive elsewhere, with easy generalization to surface normal prediction.

Significance. If the quantitative results and ablations hold, the work offers a concrete way to inject physically grounded polarization priors into latent diffusion pipelines for more reliable dense prediction, which could benefit downstream tasks in robotics and AR/VR where RGB cues fail. The gating fusion and cross-task generalization are potentially reusable ideas.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): the central construction encodes AoLP/DoLP via a pre-trained RGB VAE into the same latent space as RGB before gating and diffusion. Because AoLP is periodic (0–π) and DoLP is bounded [0,1] with statistics far from natural images, the resulting latents are likely distorted; the paper must demonstrate (via latent-space visualizations, reconstruction error, or ablation replacing the VAE) that physical polarization priors survive this mapping rather than being irrecoverably lost before the gating stage.
  2. [Abstract and §4] Abstract and §4 (experiments): the claim of 'significant outperformance' in challenging regions is asserted without any reported metrics, error breakdowns, or ablation tables in the provided abstract. The full experiments section must include quantitative comparisons (e.g., AbsRel, RMSE on textureless/specular subsets), ablation of the gating module, and analysis of polarization noise/calibration effects on real data to substantiate the load-bearing robustness claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'provides the integrated latent representation for subsequent monocular depth estimation' is redundant with the preceding sentence and could be tightened.
  2. [§3] Notation: AoLP/DoLP are introduced without explicit definition of their ranges or preprocessing steps before VAE encoding; a short paragraph or figure in §3 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the central construction encodes AoLP/DoLP via a pre-trained RGB VAE into the same latent space as RGB before gating and diffusion. Because AoLP is periodic (0–π) and DoLP is bounded [0,1] with statistics far from natural images, the resulting latents are likely distorted; the paper must demonstrate (via latent-space visualizations, reconstruction error, or ablation replacing the VAE) that physical polarization priors survive this mapping rather than being irrecoverably lost before the gating stage.

    Authors: We recognize the referee's concern about potential distortion in the latent space due to the differing statistical properties of polarization images compared to natural RGB images. Our approach relies on the pre-trained VAE to project both modalities into a shared space, allowing the gating mechanism to integrate them effectively. To validate that polarization priors are not lost, we will incorporate latent-space visualizations (e.g., t-SNE or PCA projections of latents from RGB and polarization), report reconstruction errors for DoLP/AoLP inputs, and conduct an ablation where we replace the shared VAE with a polarization-specific encoder. This will demonstrate the survival of physical cues and justify our design choice. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4 (experiments): the claim of 'significant outperformance' in challenging regions is asserted without any reported metrics, error breakdowns, or ablation tables in the provided abstract. The full experiments section must include quantitative comparisons (e.g., AbsRel, RMSE on textureless/specular subsets), ablation of the gating module, and analysis of polarization noise/calibration effects on real data to substantiate the load-bearing robustness claim.

    Authors: The abstract provides a high-level summary, while the detailed quantitative results are presented in Section 4 of the manuscript. To more thoroughly substantiate the claims of outperformance in challenging regions, we will revise the experiments section to include: (1) quantitative metrics (AbsRel, RMSE, etc.) broken down by scene types such as textureless, specular, and transparent surfaces; (2) a comprehensive ablation study isolating the contribution of the confidence-aware gating module; and (3) an analysis of the impact of polarization noise and calibration variations on real-world data performance. These additions will provide stronger evidence for the robustness benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses external pre-trained VAE and independent dataset validation

full rationale

The abstract and described method encode RGB and polarization inputs via a pre-trained VAE (external to the current work), apply a learnable gating module, and feed the result into a diffusion denoiser for depth prediction. No equation or step reduces by construction to a fitted parameter renamed as prediction, nor does any load-bearing claim rest on a self-citation chain that itself assumes the target result. Validation occurs on separate synthetic and real-world datasets, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that polarization provides useful priors for depth in specific failure cases of RGB, plus the technical assumption that VAE encoding and gating can fuse modalities effectively.

axioms (2)
  • domain assumption Polarization images supply physically grounded additional cues for depth estimation around reflective and transparent surfaces.
    Invoked as the motivation for the cross-modal approach in the abstract.
  • domain assumption A pre-trained VAE can encode both RGB and polarization into a shared latent space suitable for diffusion-based prediction.
    Stated as the encoding step without further justification in the abstract.

pith-pipeline@v0.9.0 · 5543 in / 1307 out tokens · 26699 ms · 2026-05-10T15:40:42.493233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation,

    S.-J. Park, K.-S. Hong, and S. Lee, “Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 4980– 4989

  2. [2]

    Deep learning for detecting robotic grasps,

    I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015

  3. [3]

    Real-time human pose recognition in parts from single depth images,

    J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” inCVPR 2011. Ieee, 2011, pp. 1297–1304

  4. [4]

    Metric3d: Towards zero-shot metric 3d prediction from a single image,

    W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9043–9053

  5. [5]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 371–10 381

  6. [6]

    Depth anything v2,

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

  7. [7]

    Moge: Unlocking accurate monocular geometry estimation for open- domain images with optimal training supervision,

    R. Wang, S. Xu, C. Dai, J. Xiang, Y . Deng, X. Tong, and J. Yang, “Moge: Unlocking accurate monocular geometry estimation for open- domain images with optimal training supervision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 5261–5271

  8. [8]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025

  9. [9]

    Unsupervised monocular depth estimation using attention and multi-warp reconstruction,

    C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,”IEEE Trans- actions on Multimedia, vol. 24, pp. 2938–2949, 2021

  10. [10]

    Excavating the potential capacity of self-supervised monocular depth estimation,

    R. Peng, R. Wang, Y . Lai, L. Tang, and Y . Cai, “Excavating the potential capacity of self-supervised monocular depth estimation,” inProceedings of the IEEE/cvf international conference on computer vision, 2021, pp. 15 560–15 569

  11. [11]

    Self-supervised monocular depth estimation with frequency-based recurrent refinement,

    R. Li, D. Xue, Y . Zhu, H. Wu, J. Sun, and Y . Zhang, “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Transactions on Multimedia, vol. 25, pp. 5626–5637, 2022

  12. [12]

    Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes,

    L. Sun, J.-W. Bian, H. Zhan, W. Yin, I. Reid, and C. Shen, “Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes,” IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 1, pp. 497–508, 2023

  13. [13]

    Vfm-depth: Leveraging vision foun- dation model for self-supervised monocular depth estimation,

    S. Yu, M. Wu, and S.-K. Lam, “Vfm-depth: Leveraging vision foun- dation model for self-supervised monocular depth estimation,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  14. [14]

    Unsupervised learning of depth and ego-motion from video,

    T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1851–1858

  15. [15]

    Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,

    R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5667–5675

  16. [16]

    Monodiffusion: Self-supervised monocular depth estimation using diffusion model,

    S. Shao, Z. Pei, W. Chen, D. Sun, P. C. Chen, and Z. Li, “Monodiffusion: Self-supervised monocular depth estimation using diffusion model,” IEEE Transactions on Circuits and Systems for Video Technology, 2024. 13

  17. [17]

    Monocular depth prediction using generative adversarial networks,

    A. CS Kumar, S. M. Bhandarkar, and M. Prasad, “Monocular depth prediction using generative adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 300–308

  18. [18]

    Generative adversarial networks for unsupervised monocular depth prediction,

    F. Aleotti, F. Tosi, M. Poggi, and S. Mattoccia, “Generative adversarial networks for unsupervised monocular depth prediction,” inProceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 0–0

  19. [19]

    Repurposing diffusion-based image generators for monoc- ular depth estimation,

    B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monoc- ular depth estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9492–9502

  20. [20]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

    X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 241–258

  21. [21]

    Fine-tuning image-conditional diffusion models is easier than you think,

    G. M. Garcia, K. Abou Zeid, C. Schmidt, D. De Geus, A. Hermans, and B. Leibe, “Fine-tuning image-conditional diffusion models is easier than you think,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 753–762

  22. [22]

    Depthfm: Fast generative monocular depth estimation with flow matching,

    M. Gui, J. Schusterbauer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. A. Baumann, V . T. Hu, and B. Ommer, “Depthfm: Fast generative monocular depth estimation with flow matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 3, 2025, pp. 3203–3211

  23. [23]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

    J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,”arXiv preprint arXiv:2409.18124, 2024

  24. [24]

    Deep shape from polarization,

    Y . Ba, A. Gilbert, F. Wang, J. Yang, R. Chen, Y . Wang, L. Yan, B. Shi, and A. Kadambi, “Deep shape from polarization,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 554–571

  25. [25]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

  26. [26]

    Depth map prediction from a single image using a multi-scale deep network,

    D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

  27. [27]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

  28. [28]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

  29. [29]

    Monocular depth estimation using laplacian pyramid-based depth residuals,

    M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,”IEEE transactions on circuits and systems for video technology, vol. 31, no. 11, pp. 4381–4393, 2021

  30. [30]

    Diffusion models trained with large data are transferable visual models

    G. Xu, Y . Ge, M. Liu, C. Fan, K. Xie, Z. Zhao, H. Chen, and C. Shen, “What matters when repurposing diffusion models for general dense perception tasks?”arXiv preprint arXiv:2403.06090, 2024

  31. [31]

    Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,

    A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 786–10 796

  32. [32]

    Collett,Field guide to polarization

    E. Collett,Field guide to polarization. SPIE press Bellingham, 2005, vol. 15

  33. [33]

    Polarization-based inverse rendering from a single view,

    Miyazaki, Tan, Hara, and Ikeuchi, “Polarization-based inverse rendering from a single view,” inProceedings Ninth IEEE International Confer- ence on Computer Vision. IEEE, 2003, pp. 982–987

  34. [34]

    Polarisation photometric stereo,

    G. A. Atkinson, “Polarisation photometric stereo,”Computer Vision and Image Understanding, vol. 160, pp. 158–167, 2017

  35. [35]

    Reconstruction of specular surfaces using polarization imaging,

    S. Rahmann and N. Canterakis, “Reconstruction of specular surfaces using polarization imaging,” inProceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1. IEEE, 2001, pp. I–I

  36. [36]

    Simultaneous acquisition of polarimetric svbrdf and normals

    S.-H. Baek, D. S. Jeon, X. Tong, and M. H. Kim, “Simultaneous acquisition of polarimetric svbrdf and normals.”ACM Trans. Graph., vol. 37, no. 6, p. 268, 2018

  37. [37]

    Pandora: Polarization-aided neural decomposition of radiance,

    A. Dave, Y . Zhao, and A. Veeraraghavan, “Pandora: Polarization-aided neural decomposition of radiance,” inEuropean conference on computer vision. Springer, 2022, pp. 538–556

  38. [38]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  39. [39]

    Nersp: Neural 3d reconstruction for reflective objects with sparse polarized images,

    Y . Han, H. Guo, K. Fukai, H. Santo, B. Shi, F. Okura, Z. Ma, and Y . Jia, “Nersp: Neural 3d reconstruction for reflective objects with sparse polarized images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 821–11 830

  40. [40]

    Multi-view azimuth stereo via tangent space consistency,

    X. Cao, H. Santo, F. Okura, and Y . Matsushita, “Multi-view azimuth stereo via tangent space consistency,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 825– 834

  41. [41]

    Polarimetric inverse rendering for transparent shapes reconstruction,

    M. Shao, C. Xia, D. Duan, and X. Wang, “Polarimetric inverse rendering for transparent shapes reconstruction,”IEEE Transactions on Multime- dia, vol. 26, pp. 7801–7811, 2024

  42. [42]

    Transparent shape from a single view polarization image,

    M. Shao, C. Xia, Z. Yang, J. Huang, and X. Wang, “Transparent shape from a single view polarization image,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9277–9286

  43. [43]

    Neisf: Neural incident stokes field for geometry and material estimation,

    C. Li, T. Ono, T. Uemori, H. Mihara, A. Gatto, H. Nagahara, and Y . Moriuchi, “Neisf: Neural incident stokes field for geometry and material estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 434–21 445

  44. [44]

    Depth from a polarisation+ rgb stereo pair,

    D. Zhu and W. A. Smith, “Depth from a polarisation+ rgb stereo pair,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7586–7595

  45. [45]

    P2d: a self-supervised method for depth estimation from polarimetry,

    M. Blanchon, D. Sidib ´e, O. Morel, R. Seulin, D. Braun, and F. Meri- audeau, “P2d: a self-supervised method for depth estimation from polarimetry,” in2020 25th International Conference on Pattern Recog- nition (ICPR). IEEE, 2021, pp. 7357–7364

  46. [46]

    Shape from polarization for complex scenes in the wild,

    C. Lei, C. Qi, J. Xie, N. Fan, V . Koltun, and Q. Chen, “Shape from polarization for complex scenes in the wild,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2022, pp. 12 632–12 641

  47. [47]

    Dps-net: Deep polarimetric stereo depth estimation,

    C. Tian, W. Pan, Z. Wang, M. Mao, G. Zhang, H. Bao, P. Tan, and Z. Cui, “Dps-net: Deep polarimetric stereo depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3569–3579

  48. [48]

    Robust depth enhancement via polarization prompt fusion tuning,

    K. Ikemura, Y . Huang, F. Heide, Z. Zhang, Q. Chen, and C. Lei, “Robust depth enhancement via polarization prompt fusion tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 710–20 720

  49. [49]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  50. [50]

    Laion- 5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,”Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022

  51. [51]

    Pytorch: An imperative style, high-performance deep learning library,

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

  52. [52]

    Is my Depth Ground-Truth Good Enough? HAMMER–Highly Accurate Multi-Modal Dataset for DEnse 3D Scene Regression.arXiv preprint arXiv:2205.04565, 2022

    H. Jung, P. Ruhkamp, G. Zhai, N. Brasch, Y . Li, Y . Verdie, J. Song, Y . Zhou, A. Armagan, S. Ilicet al., “Is my depth ground-truth good enough? hammer–highly accurate multi-modal dataset for dense 3d scene regression,”arXiv preprint arXiv:2205.04565, 2022

  53. [53]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

  54. [54]

    Marigold: Affordable adaptation of diffusion- based image generators for image analysis.arXiv preprint arXiv:2505.09358, 2025

    B. Ke, K. Qu, T. Wang, N. Metzger, S. Huang, B. Li, A. Obukhov, and K. Schindler, “Marigold: Affordable adaptation of diffusion-based image generators for image analysis,”arXiv preprint arXiv:2505.09358, 2025

  55. [55]

    Housecat6d-a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenarios,

    H. Jung, S.-C. Wu, P. Ruhkamp, G. Zhai, H. Schieber, G. Rizzoli, P. Wang, H. Zhao, L. Garattoni, S. Meieret al., “Housecat6d-a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 498–22 508

  56. [56]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  57. [57]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  58. [58]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017