SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping
Pith reviewed 2026-05-21 19:29 UTC · model grok-4.3
The pith
Adapting the plane sweep algorithm for sonar-vision fusion in deep learning produces dense accurate underwater depth maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SonarSweep adapts the principled plane sweep algorithm within an end-to-end deep learning framework to perform cross-modal fusion between sonar and visual data, enabling the generation of dense and accurate depth maps that outperform state-of-the-art methods in underwater environments, particularly under high turbidity.
What carries the argument
The adapted plane sweep algorithm for cross-modal fusion, which sweeps through depth planes and matches features from both sonar and vision to resolve ambiguities without extra heuristics.
If this is right
- Dense depth maps become available in high turbidity where vision-based methods fail.
- Sonar's elevation ambiguity is resolved through integration with visual data.
- Complex scenes can be modeled without the artifacts produced by prior heuristic fusion techniques.
- Robust performance holds across both high-fidelity simulation and real-world tests.
Where Pith is reading between the lines
- The same geometric fusion idea could be tested for other sensor pairs in low-visibility settings such as dust or fog.
- Real-time versions might improve autonomous underwater vehicle navigation and obstacle avoidance.
- The released synchronized dataset allows other groups to train and compare alternative fusion networks directly.
Load-bearing premise
The principled plane sweep algorithm can be directly adapted for effective cross-modal fusion between sonar and visual data without introducing new geometric artifacts or requiring additional heuristics in complex scenes.
What would settle it
Running the released real-world dataset in high-turbidity conditions and finding that SonarSweep depth maps contain more artifacts or lower accuracy than heuristic-based fusion methods would disprove the central claim.
Figures
read the original abstract
Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SonarSweep, an end-to-end deep learning framework that adapts the classical plane-sweep algorithm for cross-modal fusion of sonar range-bearing measurements and visual imagery. It claims this yields dense, accurate depth maps in underwater settings where single-modality approaches fail, with consistent outperformance over prior fusion methods especially under high turbidity; experiments are reported in high-fidelity simulation and real-world environments, and the authors commit to releasing code plus a new synchronized stereo-sonar dataset.
Significance. A rigorously derived plane-sweep fusion that correctly maps conical sonar returns onto swept planes while preserving geometric consistency would address a long-standing gap in underwater perception and could enable more reliable dense reconstruction for AUV navigation and mapping. The promised public release of code and the first synchronized stereo-sonar dataset would add substantial community value if the core geometric adaptation is shown to be artifact-free.
major comments (2)
- [§3.2] §3.2 (Cost-volume construction): the manuscript must supply an explicit geometric derivation showing how a sonar range-bearing observation intersects a given fronto-parallel plane, how elevation ambiguity is resolved or marginalized, and how the resulting cost is aggregated with visual features. Without this derivation the adaptation cannot be called 'principled' and risks re-introducing the heuristic assumptions the abstract criticizes in prior work.
- [§4] §4 (Experiments): quantitative results are presented without error bars, ablation studies isolating the sonar-to-plane projection, or clear dataset statistics (number of frames, turbidity levels, ground-truth acquisition method). These omissions prevent verification that the reported outperformance is attributable to the claimed geometric fusion rather than implementation details or dataset bias.
minor comments (2)
- [Figure 3] Figure 3 caption and §3.1: the notation for the sonar projection matrix is introduced without a preceding equation; add an explicit definition before its first use.
- [Related Work] Related-work section: several recent sonar-vision fusion papers (e.g., 2023–2024) are cited only by name; add one-sentence summaries of their geometric assumptions to clarify the precise novelty of SonarSweep.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The suggestions to strengthen the geometric derivation and experimental reporting are valuable, and we have revised the manuscript to address them directly while preserving the core contributions of SonarSweep.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Cost-volume construction): the manuscript must supply an explicit geometric derivation showing how a sonar range-bearing observation intersects a given fronto-parallel plane, how elevation ambiguity is resolved or marginalized, and how the resulting cost is aggregated with visual features. Without this derivation the adaptation cannot be called 'principled' and risks re-introducing the heuristic assumptions the abstract criticizes in prior work.
Authors: We appreciate the referee's call for an explicit geometric derivation. The original §3.2 describes the sonar-to-plane mapping and cost-volume construction at a high level, but we agree that a self-contained derivation is needed to fully substantiate the claim of a principled adaptation. In the revised manuscript we have inserted a new subsection that derives: (1) the intersection geometry of a conical sonar range-bearing measurement with a fronto-parallel plane at hypothesized depth d, (2) marginalization of elevation ambiguity via integration over the elevation angle consistent with the plane intersection (or learned probabilistic weighting within the network), and (3) the subsequent aggregation of the resulting sonar cost with visual feature similarity scores. This derivation is free of the heuristic assumptions criticized in prior fusion work and directly supports the end-to-end training. revision: yes
-
Referee: [§4] §4 (Experiments): quantitative results are presented without error bars, ablation studies isolating the sonar-to-plane projection, or clear dataset statistics (number of frames, turbidity levels, ground-truth acquisition method). These omissions prevent verification that the reported outperformance is attributable to the claimed geometric fusion rather than implementation details or dataset bias.
Authors: We agree that the experimental section would benefit from these additions to allow independent verification. In the revised §4 we now report error bars (standard deviation across repeated training runs and cross-validation folds) on all quantitative metrics, include dedicated ablation studies that isolate the sonar-to-plane projection component by comparing full SonarSweep against variants that replace the geometric mapping with simpler heuristics, and provide expanded dataset statistics: exact frame counts, simulated and measured turbidity levels (with corresponding attenuation coefficients), and ground-truth acquisition details (simulation ray-tracing for synthetic data; synchronized stereo reconstruction in a controlled tank for real data). These revisions confirm that the observed gains are attributable to the geometric fusion. revision: yes
Circularity Check
No significant circularity in claimed adaptation of plane sweep
full rationale
The paper presents SonarSweep as a novel end-to-end framework adapting the established plane-sweep algorithm for cross-modal sonar-vision fusion. The abstract and provided description introduce the method without any equations, fitted parameters, or self-citations that reduce the output depth maps or fusion process to the inputs by construction. The central claim of principled adaptation is supported by claimed experimental validation in simulation and real environments rather than by definitional equivalence or load-bearing self-reference. This is the most common honest finding for papers that build on standard algorithms like plane sweeping without internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mapping 3d underwater environments with smoothed submaps,
M. VanMiddlesworth, M. Kaess, F. Hover, and J. J. Leonard, “Mapping 3d underwater environments with smoothed submaps,” inField and Service Robotics: Results of the 9th International Conference, Springer, 2015, pp. 17–30
work page 2015
-
[2]
Inspection and maintenance of industrial infrastructure with autonomous underwater robots,
F. Nauert and P. Kampmann, “Inspection and maintenance of industrial infrastructure with autonomous underwater robots,”Frontiers in Robotics and AI, vol. 10, p. 1 240 276, 2023
work page 2023
-
[3]
Limi- tations of vision guided underwater navigation,
M. E. Angelopoulou, C. Tsiotsios, and M. Petrou, “Limi- tations of vision guided underwater navigation,”IFAC Pro- ceedings Volumes, vol. 45, no. 5, pp. 312–317, 2012
work page 2012
-
[4]
Turbid-water subsea infrastructure 3d recon- struction with assisted stereo,
R. Detry et al., “Turbid-water subsea infrastructure 3d recon- struction with assisted stereo,” in2018 OCEANS-MTS/IEEE Kobe Techno-Oceans (OTO), IEEE, 2018, pp. 1–6
work page 2018
- [5]
-
[6]
J.-Y . Park, H. Baek, B.-H. Jun, and P.-M. Lee, “3d recon- struction using multiple acoustic images under roll motion based on backprojection techniques,” inOCEANS 2023 - MTS/IEEE U.S. Gulf Coast, 2023, pp. 1–4.DOI:10 . 23919/OCEANS52994.2023.10337360
-
[7]
Fusing concur- rent orthogonal wide-aperture sonar images for dense under- water 3d reconstruction,
J. McConnell, J. D. Martin, and B. Englot, “Fusing concur- rent orthogonal wide-aperture sonar images for dense under- water 3d reconstruction,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 1653–1660.DOI:10 . 1109 / IROS45743 . 2020 . 9340995
work page 2020
-
[8]
Map building fusing acoustic and visual information using autonomous underwater vehicles,
C. Kunz and H. Singh, “Map building fusing acoustic and visual information using autonomous underwater vehicles,” Journal of field robotics, vol. 30, no. 5, pp. 763–783, 2013
work page 2013
-
[9]
Underwater monocular image depth estimation using single-beam echosounder,
M. Roznere and A. Q. Li, “Underwater monocular image depth estimation using single-beam echosounder,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2020, pp. 1785–1790
work page 2020
- [10]
-
[11]
I. Collado-Gonzalez, J. McConnell, P. Szenher, and B. Englot,Opti-acoustic scene reconstruction in highly tur- bid underwater environments, 2025. arXiv:2508.03408 [cs.RO]. [Online]. Available:https://arxiv.org/ abs/2508.03408
-
[12]
DPSNet: End-to-end Deep Plane Sweep Stereo
S. Im, H.-G. Jeon, S. Lin, and I. S. Kweon,Dpsnet: End-to- end deep plane sweep stereo, 2019. arXiv:1905.00538 [cs.CV]. [Online]. Available:https://arxiv.org/ abs/1905.00538
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[13]
Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan,Mvsnet: Depth inference for unstructured multi-view stereo, 2018. arXiv: 1804.02505 [cs.CV]. [Online]. Available:https:// arxiv.org/abs/1804.02505
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
N. Hurtos, D. Ribas, X. Cufi, Y . Petillot, and J. Salvi, “Fourier-based registration for robust forward-looking sonar mosaicing in low-visibility underwater environments,”Jour- nal of Field Robotics, vol. 32, no. 1, pp. 123–151, 2015
work page 2015
-
[15]
B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield,Foundationstereo: Zero-shot stereo matching,
-
[16]
Foundationstereo: Zero-shot stereo matching
arXiv:2501.09898 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2501.09898
-
[17]
Agisoft LLC,Agisoft metashape professional,https:// www.agisoft.com/, Version 2.0.2, 2023
work page 2023
-
[18]
Flsea: Underwater visual-inertial and stereo- vision forward-looking datasets,
Y . Randall, “Flsea: Underwater visual-inertial and stereo- vision forward-looking datasets,” M.S. thesis, University of Haifa (Israel), 2023
work page 2023
-
[19]
Oceansim: A gpu-accelerated underwa- ter robot perception simulation framework,
J. Song, H. Ma, O. Bagoren, A. V . Sethuraman, Y . Zhang, and K. A. Skinner, “Oceansim: A gpu-accelerated underwa- ter robot perception simulation framework,”arXiv preprint arXiv:2503.01074, 2025
-
[20]
Differentiable space carving for 3d reconstruction using imaging sonar,
Y . Feng, W. Lu, H. Gao, B. Nie, K. Lin, and L. Hu, “Differentiable space carving for 3d reconstruction using imaging sonar,”IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 10 065–10 072, 2024.DOI:10.1109/ LRA.2024.3469778
-
[21]
Practical blind image denoising via swin- conv-unet and data synthesis,
K. Zhang et al., “Practical blind image denoising via swin- conv-unet and data synthesis,”Machine Intelligence Re- search, vol. 20, no. 6, pp. 822–836, Sep. 2023,ISSN: 2731- 5398.DOI:10.1007/s11633- 023- 1466- 0[Online]. Available:http://dx.doi.org/10.1007/s11633- 023-1466-0
-
[22]
An image synthesis method generating underwater images,
J. R. Ahamed, P. E. Abas, and L. C. De Silva, “An image synthesis method generating underwater images,”Advances in Technology Innovation, vol. 7, no. 3, p. 195, 2022
work page 2022
-
[23]
N. G. Jerlov and F. F. Koczy,Photographic measurements of daylight in deep water. Elanders boktr., 1951
work page 1951
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.