pith. sign in

arxiv: 2512.12165 · v3 · submitted 2025-12-13 · 💻 cs.CV

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Pith reviewed 2026-05-16 23:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords audio-visualcamera pose estimationpassive audioin-the-wild videodirection of arrivalbinaural embeddingsrelative pose3D scene understanding
0
0 comments X

The pith

Passive scene sounds complement vision for relative camera pose estimation in real-world videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Estimating how a camera moves through a scene is essential for 3D understanding but visual methods alone often fail when images are blurry or parts of the scene are hidden. This paper shows that sounds naturally occurring in the environment can supply extra spatial information to help solve this problem. The authors add simple audio features, specifically direction-of-arrival spectra and binaural embeddings, to a top visual pose model and test it on large collections of everyday videos. The combined system produces better pose estimates than vision alone, and the benefit remains even when the video quality drops. This approach marks the first successful use of incidental audio for this task outside controlled settings.

Core claim

Passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. A simple audio-visual framework integrates direction-of-arrival spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model, yielding consistent gains over strong visual baselines on two large datasets along with robustness to visual corruption.

What carries the argument

Direction-of-arrival spectra and binauralized embeddings extracted from passive scene audio, integrated into vision-only pose models.

Load-bearing premise

That direction-of-arrival spectra and binauralized embeddings from passive scene sounds can be integrated with visual features to produce consistent and measurable improvements in pose estimation accuracy.

What would settle it

Running the proposed integration on the same datasets and observing no accuracy gains or even losses compared to the vision-only baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.12165 by Daniel Adebi, Kristen Grauman, Sagnik Majumder.

Figure 1
Figure 1. Figure 1: Main idea: we propose to estimate relative camera pose from in-the-wild videos using both vision and multichannel audio. Unlike traditional active echolocation sensing, our model relies only on passive scene audio—gaining spatial cues opportunistically from naturally occurring ambient and foreground sound sources. Second, everyday sounds are invariant to some of the key obstacles for traditional vision-onl… view at source ↗
Figure 2
Figure 2. Figure 2: We extend the Reloc3r [14] architecture by incorporating both analytical and learned audio embeddings extracted using our Spatial Audio Encoder (SAE). Given a pair of source and target frames, coupled with their corresponding synchronized audio clips, our modified Reloc3r network predicts relative camera poses for the input image pair, in both directions (from source to target, and from target to source). … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of our audio-visual method outperforming a vision-only state-of-the-art relative camera pose estimation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our model performance compared against a SOTA vision [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that passive scene sounds supply complementary cues to vision for relative camera pose estimation on in-the-wild videos. It introduces a framework that augments a state-of-the-art vision-only model with direction-of-arrival (DOA) spectra and binauralized embeddings extracted from incidental audio, reporting consistent gains over visual baselines on two large datasets together with improved robustness when visual input is corrupted by blur or occlusion. The work positions itself as the first to successfully leverage everyday audio for this spatial task.

Significance. If the quantitative claims hold under scrutiny, the result would be significant for embodied perception and multi-modal 3D understanding: it demonstrates that readily available passive audio can mitigate well-known failure modes of vision-only pose estimators without requiring additional sensors or active sound sources. The approach could influence downstream applications in robotics and AR where visual degradation is common.

major comments (3)
  1. [§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.
  2. [§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.
  3. [§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.
minor comments (2)
  1. [Figure 2] Figure 2: the diagram of the audio-visual fusion module would benefit from explicit notation for how DOA spectra are concatenated or attended with visual features.
  2. [Related Work] Related Work section: a brief comparison to recent audio-visual localization papers (e.g., those using active sound sources) would strengthen the novelty positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve technical clarity, add requested controls, and enhance reproducibility.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.

    Authors: We agree that the current description in §3.2 lacks sufficient implementation details. In the revised manuscript we will explicitly state that DOA spectra are computed via the SRP-PHAT algorithm over the frequency range 100–8000 Hz, together with the preprocessing steps used to mitigate non-stationary sources and reverberation (short-time windowing and coherence-based masking). These additions will allow readers to confirm that the reported gains derive from genuine spatial geometry rather than dataset-specific artifacts. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.

    Authors: We acknowledge the need for finer-grained controls. The revised version will include new ablation tables that separately disable DOA spectra and binaural embeddings, paired statistical significance tests (Wilcoxon signed-rank) on the pose-error differences, and per-dataset variance statistics. These additions will directly demonstrate that the observed improvements are attributable to the audio components rather than implementation or dataset biases. revision: yes

  3. Referee: [§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.

    Authors: We agree that the corruption parameters must be fully specified. In the revised §4.3 we will document the exact motion-blur kernel sizes (15×15 and 25×25 pixels), the occlusion ratios (20 % and 40 % random rectangular masks), and the frame-wise application procedure. These details will enable exact reproduction and allow readers to assess the degree to which audio compensates for the stated degradation levels. revision: yes

Circularity Check

0 steps flagged

No circularity in audio-visual pose estimation framework

full rationale

The paper presents an additive framework that extracts DOA spectra and binaural embeddings from passive audio and integrates them into an existing vision-only pose model. Gains are shown empirically on two in-the-wild datasets against visual baselines, with no equations, parameters, or derivations that reduce by construction to fitted inputs or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked to force the result; the central claim rests on standard feature fusion and dataset evaluation rather than tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5462 in / 1122 out tokens · 41878 ms · 2026-05-16T23:11:34.925547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

  1. [1]

    Self-supervised learning of audio-visual objects from video

    Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020. 2

  2. [2]

    Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

    Relja Arandjelovic and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017. 2

  3. [3]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 2

  4. [4]

    Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InNeurIPS, 2021. 2

  5. [5]

    Can generative video models help pose estimation? InCVPR, 2025

    Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin-Brualla. Can generative video models help pose estimation? InCVPR, 2025. 2

  6. [6]

    Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017. 2

  7. [7]

    Soundspaces 2.0: A simula- tion platform for visual-acoustic learning

    Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robin- son, and Kristen Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning. InNeurIPS 2022 Datasets and Benchmarks Track, 2022. 1, 2, 5, 8

  8. [8]

    Novel-view acoustic synthesis

    Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, and Andrea Vedaldi. Novel-view acoustic synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023. 4

  9. [9]

    Sound localization from motion: Jointly learning sound direction and camera rotation

    Ziyang Chen, Shengyi Qian, and Andrew Owens. Sound localization from motion: Jointly learning sound direction and camera rotation. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2, 4, 5, 6, 7, 12

  10. [10]

    Jesper Haahr Christensen, Sascha Hornauer, and Stella X. Yu. Batvision: Learning to see 3d spatial layout with two ears.2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1581–1587, 2019. 1, 2

  11. [11]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 2

  12. [12]

    The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 2

  13. [13]

    Fixing the scale and shift in monocular depth for camera pose estimation.arXiv preprint arXiv:2501.07742, 2025

    Yaqing Ding, Viktor Kocur, Vaclav Vavra, Zuzana Berger Haladova, Jian Yang, Torsten Sattler, and Zuzana Kukelova. Reposed: Efficient relative pose estimation with known depth information.arXiv preprint arXiv:2501.07742, 2025. 2

  14. [14]

    Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16739–16752, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 13

  15. [15]

    Tenenbaum

    Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B. Tenenbaum. Look, listen, and act: Towards audio- visual embodied navigation. InICRA, 2020. 1, 2

  16. [16]

    2.5d visual sound

    Ruohan Gao and Kristen Grauman. 2.5d visual sound. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019. 4, 5, 12

  17. [17]

    Visualechoes: Spatial image represen- tation learning through echolocation

    Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, and Kristen Grauman. Visualechoes: Spatial image represen- tation learning through echolocation. InECCV, 2020. 1, 2, 4

  18. [18]

    Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022

    Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tu- dor Ionescu, Mario Lucic, Cordelia Schmid, and Anurag Arnab. Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022. 2

  19. [19]

    Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

    Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2

  20. [20]

    Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

  21. [21]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 5

  22. [22]

    Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022

    Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022. 2

  23. [23]

    Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018

    Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018. 2

  24. [24]

    Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset

    Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. InBritish Machine Vision Conference (BMVC), 2018. 2

  25. [25]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 2 9

  26. [26]

    Cyclic learning for binaural audio generation and localization

    Zhaojian Li, Bin Zhao, and Yuan Yuan. Cyclic learning for binaural audio generation and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26669–26678, 2024. 2

  27. [27]

    Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2

  28. [28]

    Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

    Jiawei Liu, Zhijie Wang, Lei Ma, Chunrong Fang, Tongtong Bai, Xufan Zhang, Jia Liu, and Zhenyu Chen. Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

  29. [29]

    Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny

    Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncommon objects in 3d. InarXiv, 2024. 2

  30. [30]

    Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,

    Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co-visibility segmentation for relative camera pose regression.CoRR, abs/2503.07561, 2025. 2

  31. [31]

    Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024

    Sihan Ma, Jing Zhang, Qiong Cao, and Dacheng Tao. Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024. 5

  32. [32]

    Learning spatial features from audio-visual correspondence in egocentric videos

    Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. Learning spatial features from audio-visual correspondence in egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 27058–27068, 2024. 4

  33. [33]

    Supervising sound localization by in-the-wild egomotion

    Anna Min, Ziyang Chen, Hang Zhao, and Andrew Owens. Supervising sound localization by in-the-wild egomotion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23936–23946, 2025. 2

  34. [34]

    On in- teraction between augmentations and corruptions in natural corruption robustness

    Eric Mintun, Alexander Kirillov, and Saining Xie. On in- teraction between augmentations and corruptions in natural corruption robustness. InAdvances in Neural Information Processing Systems, 2021. 5

  35. [35]

    The StreetLearn Environment and Dataset

    Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Mali- nowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. The streetlearn environment and dataset.CoRR, abs/1903.01292,

  36. [36]

    Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020

    Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020. 2

  37. [37]

    Morgado, Y

    P. Morgado, Y . Li, and N. Vasconcelos. Learning representa- tions from audio-visual spatial alignment. InNeurIPS, 2020. 4

  38. [38]

    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and A. Ng. Multimodal deep learning. In International Conference on Machine Learning, 2011. 2

  39. [39]

    Audio-visual scene analysis with self-supervised multisensory features

    Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. 2

  40. [40]

    Visually indicated sounds

    Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413,

  41. [41]

    McDermott, William T

    Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. InEuropean Conference on Computer Vision, 2016. 2

  42. [42]

    Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention

    Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2151–2160, 2022. 4

  43. [43]

    pysofaconventions: Python implemen- tation of the SOFA specification, 2019

    Andres Perez-Lopez. pysofaconventions: Python implemen- tation of the SOFA specification, 2019. 4

  44. [44]

    Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

    Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexan- der William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladimir V ondrus, Vincent-Pierre Berges, John Turner, Olek- sandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Ak- shara...

  45. [45]

    Senthil Purushwalkam, S. V . A. Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Kumar Gupta, and Kristen Grauman. Audio-visual floorplan reconstruction. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 1163–1172, 2020. 1, 2

  46. [46]

    Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wi- jmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew West- bury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI. InThirty-fifth Con- ference on Neural Info...

  47. [47]

    Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR,

  48. [48]

    Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014

    Daniele Salvati, Carlo Drioli, and Gian Luca Foresti. Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014. 4

  49. [49]

    SuperGlue: Learning feature match- ing with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 2, 6

  50. [50]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2 10

  51. [51]

    R. Schmidt. Multiple emitter location and signal parameter estimation.IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986. 4

  52. [52]

    Seitz, and Richard Szeliski

    Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. InSIGGRAPH Conference Proceedings, pages 835–846, New York, NY , USA, 2006. ACM Press. 2

  53. [53]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kim- berly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Ba- tra, Hauke ...

  54. [54]

    LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021. 2, 6

  55. [55]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

  56. [56]

    Habitat 2.0: Training home assistants to rearrange their habitat

    Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...

  57. [57]

    Yfcc100m: The new data in multimedia research

    Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 2

  58. [58]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

  59. [59]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2

  60. [60]

    Efficient LoFTR: Semi-dense local feature matching with sparse-like speed

    Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. InCVPR, 2024. 6

  61. [61]

    Anycam: Learning to re- cover camera poses and intrinsics from casual videos

    Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, and Daniel Cremers. Anycam: Learning to re- cover camera poses and intrinsics from casual videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 2

  62. [62]

    Binaural audio-visual localization

    Xinyi Wu, Zhenyao Wu, Lili Ju, and Song Wang. Binaural audio-visual localization. InAAAI, pages 2961–2968, 2021. 2

  63. [63]

    Youtube-vos: Sequence-to-sequence video object segmentation

    Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InProceedings of the European conference on computer vision (ECCV), pages 585–601, 2018. 2

  64. [64]

    Telling left from right: Learning spatial correspondence of sight and sound

    Karren Yang, Bryan Russell, and Justin Salamon. Telling left from right: Learning spatial correspondence of sight and sound. InCVPR, 2020. 4

  65. [65]

    Camera pose estimation and localization with active audio sensing

    Karren Yang, Michael Firman, Eric Brachmann, and Clement Godard. Camera pose estimation and localization with active audio sensing. InEuropean Conference on Computer Vision, pages 271–291. Springer, 2022. 1, 2

  66. [66]

    Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020. 2

  67. [67]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023. 2

  68. [68]

    Geonet: Unsupervised learning of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

  69. [69]

    MonST3r: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. 2

  70. [70]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2 11 Table 5. Mean of the ground-truth rotation and translation of the camera, expressed using angles, for different activity scenarios. Scenario Mean Rotation (◦) Mean Trans...

  71. [71]

    Supplementary material 7.1. Implementation Details Here, we provide our model’s training hyperparameters: • Training epochs: 100 • Warmup epochs: 5 • Learning rate:10 −5 • Minimum learning rate:10 −7 • Batch size: 64 • Learning rate scheduler: Cosine annealing Here, we provide the steps and parameters for processing the visual inputs for Reloc3r [14], spe...