Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video
Pith reviewed 2026-05-16 23:11 UTC · model grok-4.3
The pith
Passive scene sounds complement vision for relative camera pose estimation in real-world videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. A simple audio-visual framework integrates direction-of-arrival spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model, yielding consistent gains over strong visual baselines on two large datasets along with robustness to visual corruption.
What carries the argument
Direction-of-arrival spectra and binauralized embeddings extracted from passive scene audio, integrated into vision-only pose models.
Load-bearing premise
That direction-of-arrival spectra and binauralized embeddings from passive scene sounds can be integrated with visual features to produce consistent and measurable improvements in pose estimation accuracy.
What would settle it
Running the proposed integration on the same datasets and observing no accuracy gains or even losses compared to the vision-only baseline would falsify the central claim.
Figures
read the original abstract
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that passive scene sounds supply complementary cues to vision for relative camera pose estimation on in-the-wild videos. It introduces a framework that augments a state-of-the-art vision-only model with direction-of-arrival (DOA) spectra and binauralized embeddings extracted from incidental audio, reporting consistent gains over visual baselines on two large datasets together with improved robustness when visual input is corrupted by blur or occlusion. The work positions itself as the first to successfully leverage everyday audio for this spatial task.
Significance. If the quantitative claims hold under scrutiny, the result would be significant for embodied perception and multi-modal 3D understanding: it demonstrates that readily available passive audio can mitigate well-known failure modes of vision-only pose estimators without requiring additional sensors or active sound sources. The approach could influence downstream applications in robotics and AR where visual degradation is common.
major comments (3)
- [§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.
- [§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.
- [§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.
minor comments (2)
- [Figure 2] Figure 2: the diagram of the audio-visual fusion module would benefit from explicit notation for how DOA spectra are concatenated or attended with visual features.
- [Related Work] Related Work section: a brief comparison to recent audio-visual localization papers (e.g., those using active sound sources) would strengthen the novelty positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve technical clarity, add requested controls, and enhance reproducibility.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.
Authors: We agree that the current description in §3.2 lacks sufficient implementation details. In the revised manuscript we will explicitly state that DOA spectra are computed via the SRP-PHAT algorithm over the frequency range 100–8000 Hz, together with the preprocessing steps used to mitigate non-stationary sources and reverberation (short-time windowing and coherence-based masking). These additions will allow readers to confirm that the reported gains derive from genuine spatial geometry rather than dataset-specific artifacts. revision: yes
-
Referee: [§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.
Authors: We acknowledge the need for finer-grained controls. The revised version will include new ablation tables that separately disable DOA spectra and binaural embeddings, paired statistical significance tests (Wilcoxon signed-rank) on the pose-error differences, and per-dataset variance statistics. These additions will directly demonstrate that the observed improvements are attributable to the audio components rather than implementation or dataset biases. revision: yes
-
Referee: [§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.
Authors: We agree that the corruption parameters must be fully specified. In the revised §4.3 we will document the exact motion-blur kernel sizes (15×15 and 25×25 pixels), the occlusion ratios (20 % and 40 % random rectangular masks), and the frame-wise application procedure. These details will enable exact reproduction and allow readers to assess the degree to which audio compensates for the stated degradation levels. revision: yes
Circularity Check
No circularity in audio-visual pose estimation framework
full rationale
The paper presents an additive framework that extracts DOA spectra and binaural embeddings from passive audio and integrates them into an existing vision-only pose model. Gains are shown empirically on two in-the-wild datasets against visual baselines, with no equations, parameters, or derivations that reduce by construction to fitted inputs or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked to force the result; the central claim rests on standard feature fusion and dataset evaluation rather than tautological renaming or ansatz smuggling.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Self-supervised learning of audio-visual objects from video
Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020. 2
work page 2020
-
[2]
Relja Arandjelovic and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017. 2
work page 2017
-
[3]
Map-free visual relocalization: Metric pose relative to a single image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 2
work page 2022
-
[4]
Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InNeurIPS, 2021. 2
work page 2021
-
[5]
Can generative video models help pose estimation? InCVPR, 2025
Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin-Brualla. Can generative video models help pose estimation? InCVPR, 2025. 2
work page 2025
-
[6]
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017. 2
work page 2017
-
[7]
Soundspaces 2.0: A simula- tion platform for visual-acoustic learning
Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robin- son, and Kristen Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning. InNeurIPS 2022 Datasets and Benchmarks Track, 2022. 1, 2, 5, 8
work page 2022
-
[8]
Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, and Andrea Vedaldi. Novel-view acoustic synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023. 4
work page 2023
-
[9]
Sound localization from motion: Jointly learning sound direction and camera rotation
Ziyang Chen, Shengyi Qian, and Andrew Owens. Sound localization from motion: Jointly learning sound direction and camera rotation. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2, 4, 5, 6, 7, 12
work page 2023
-
[10]
Jesper Haahr Christensen, Sascha Hornauer, and Stella X. Yu. Batvision: Learning to see 3d spatial layout with two ears.2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1581–1587, 2019. 1, 2
work page 2020
-
[11]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 2
work page 2017
-
[12]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 2
work page 2020
-
[13]
Yaqing Ding, Viktor Kocur, Vaclav Vavra, Zuzana Berger Haladova, Jian Yang, Torsten Sattler, and Zuzana Kukelova. Reposed: Efficient relative pose estimation with known depth information.arXiv preprint arXiv:2501.07742, 2025. 2
-
[14]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16739–16752, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 13
work page 2025
- [15]
-
[16]
Ruohan Gao and Kristen Grauman. 2.5d visual sound. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019. 4, 5, 12
work page 2019
-
[17]
Visualechoes: Spatial image represen- tation learning through echolocation
Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, and Kristen Grauman. Visualechoes: Spatial image represen- tation learning through echolocation. InECCV, 2020. 1, 2, 4
work page 2020
-
[18]
Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022
Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tu- dor Ionescu, Mario Lucic, Cordelia Schmid, and Anurag Arnab. Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022. 2
-
[19]
Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R
Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2
work page 2023
-
[20]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
work page 2024
-
[21]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 5
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[22]
Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022
Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022. 2
-
[23]
Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018. 2
work page 2018
-
[24]
Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset
Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. InBritish Machine Vision Conference (BMVC), 2018. 2
work page 2018
-
[25]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 2 9
work page 2041
-
[26]
Cyclic learning for binaural audio generation and localization
Zhaojian Li, Bin Zhao, and Yuan Yuan. Cyclic learning for binaural audio generation and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26669–26678, 2024. 2
work page 2024
-
[27]
Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2
work page 2024
-
[28]
Jiawei Liu, Zhijie Wang, Lei Ma, Chunrong Fang, Tongtong Bai, Xufan Zhang, Jia Liu, and Zhenyu Chen. Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,
-
[29]
Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny
Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncommon objects in 3d. InarXiv, 2024. 2
work page 2024
-
[30]
Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co-visibility segmentation for relative camera pose regression.CoRR, abs/2503.07561, 2025. 2
-
[31]
Sihan Ma, Jing Zhang, Qiong Cao, and Dacheng Tao. Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024. 5
-
[32]
Learning spatial features from audio-visual correspondence in egocentric videos
Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. Learning spatial features from audio-visual correspondence in egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 27058–27068, 2024. 4
work page 2024
-
[33]
Supervising sound localization by in-the-wild egomotion
Anna Min, Ziyang Chen, Hang Zhao, and Andrew Owens. Supervising sound localization by in-the-wild egomotion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23936–23946, 2025. 2
work page 2025
-
[34]
On in- teraction between augmentations and corruptions in natural corruption robustness
Eric Mintun, Alexander Kirillov, and Saining Xie. On in- teraction between augmentations and corruptions in natural corruption robustness. InAdvances in Neural Information Processing Systems, 2021. 5
work page 2021
-
[35]
The StreetLearn Environment and Dataset
Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Mali- nowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. The streetlearn environment and dataset.CoRR, abs/1903.01292,
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[36]
Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020. 2
work page 2020
-
[37]
P. Morgado, Y . Li, and N. Vasconcelos. Learning representa- tions from audio-visual spatial alignment. InNeurIPS, 2020. 4
work page 2020
-
[38]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and A. Ng. Multimodal deep learning. In International Conference on Machine Learning, 2011. 2
work page 2011
-
[39]
Audio-visual scene analysis with self-supervised multisensory features
Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. 2
work page 2018
-
[40]
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413,
-
[41]
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. InEuropean Conference on Computer Vision, 2016. 2
work page 2016
-
[42]
Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2151–2160, 2022. 4
work page 2022
-
[43]
pysofaconventions: Python implemen- tation of the SOFA specification, 2019
Andres Perez-Lopez. pysofaconventions: Python implemen- tation of the SOFA specification, 2019. 4
work page 2019
-
[44]
Habitat 3.0: A co-habitat for humans, avatars and robots, 2023
Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexan- der William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladimir V ondrus, Vincent-Pierre Berges, John Turner, Olek- sandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Ak- shara...
work page 2023
-
[45]
Senthil Purushwalkam, S. V . A. Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Kumar Gupta, and Kristen Grauman. Audio-visual floorplan reconstruction. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 1163–1172, 2020. 1, 2
work page 2021
-
[46]
Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI
Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wi- jmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew West- bury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI. InThirty-fifth Con- ference on Neural Info...
work page 2021
-
[47]
Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR,
-
[48]
Daniele Salvati, Carlo Drioli, and Gian Luca Foresti. Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014. 4
work page 2014
-
[49]
SuperGlue: Learning feature match- ing with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 2, 6
work page 2020
-
[50]
Habitat: A Platform for Embodied AI Research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2 10
work page 2019
-
[51]
R. Schmidt. Multiple emitter location and signal parameter estimation.IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986. 4
work page 1986
-
[52]
Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. InSIGGRAPH Conference Proceedings, pages 835–846, New York, NY , USA, 2006. ACM Press. 2
work page 2006
-
[53]
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kim- berly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Ba- tra, Hauke ...
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[54]
LoFTR: Detector-free local feature matching with transformers.CVPR, 2021
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021. 2, 6
work page 2021
-
[55]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...
work page 2020
-
[56]
Habitat 2.0: Training home assistants to rearrange their habitat
Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...
work page 2021
-
[57]
Yfcc100m: The new data in multimedia research
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 2
work page 2016
-
[58]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2
work page 2025
-
[59]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2
work page 2024
-
[60]
Efficient LoFTR: Semi-dense local feature matching with sparse-like speed
Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. InCVPR, 2024. 6
work page 2024
-
[61]
Anycam: Learning to re- cover camera poses and intrinsics from casual videos
Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, and Daniel Cremers. Anycam: Learning to re- cover camera poses and intrinsics from casual videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 2
work page 2025
-
[62]
Binaural audio-visual localization
Xinyi Wu, Zhenyao Wu, Lili Ju, and Song Wang. Binaural audio-visual localization. InAAAI, pages 2961–2968, 2021. 2
work page 2021
-
[63]
Youtube-vos: Sequence-to-sequence video object segmentation
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InProceedings of the European conference on computer vision (ECCV), pages 585–601, 2018. 2
work page 2018
-
[64]
Telling left from right: Learning spatial correspondence of sight and sound
Karren Yang, Bryan Russell, and Justin Salamon. Telling left from right: Learning spatial correspondence of sight and sound. InCVPR, 2020. 4
work page 2020
-
[65]
Camera pose estimation and localization with active audio sensing
Karren Yang, Michael Firman, Eric Brachmann, and Clement Godard. Camera pose estimation and localization with active audio sensing. InEuropean Conference on Computer Vision, pages 271–291. Springer, 2022. 1, 2
work page 2022
-
[66]
Blendedmvs: A large-scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020. 2
work page 2020
-
[67]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023. 2
work page 2023
-
[68]
Geonet: Unsupervised learning of dense depth, optical flow and camera pose
Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2
work page 2018
-
[69]
MonST3r: A simple approach for estimating geometry in the presence of motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. 2
work page 2025
-
[70]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2 11 Table 5. Mean of the ground-truth rotation and translation of the camera, expressed using angles, for different activity scenarios. Scenario Mean Rotation (◦) Mean Trans...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[71]
Supplementary material 7.1. Implementation Details Here, we provide our model’s training hyperparameters: • Training epochs: 100 • Warmup epochs: 5 • Learning rate:10 −5 • Minimum learning rate:10 −7 • Batch size: 64 • Learning rate scheduler: Cosine annealing Here, we provide the steps and parameters for processing the visual inputs for Reloc3r [14], spe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.