Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi; Kristen Grauman; Sagnik Majumder

arxiv: 2512.12165 · v3 · submitted 2025-12-13 · 💻 cs.CV

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi , Sagnik Majumder , Kristen Grauman This is my paper

Pith reviewed 2026-05-16 23:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords audio-visualcamera pose estimationpassive audioin-the-wild videodirection of arrivalbinaural embeddingsrelative pose3D scene understanding

0 comments

The pith

Passive scene sounds complement vision for relative camera pose estimation in real-world videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Estimating how a camera moves through a scene is essential for 3D understanding but visual methods alone often fail when images are blurry or parts of the scene are hidden. This paper shows that sounds naturally occurring in the environment can supply extra spatial information to help solve this problem. The authors add simple audio features, specifically direction-of-arrival spectra and binaural embeddings, to a top visual pose model and test it on large collections of everyday videos. The combined system produces better pose estimates than vision alone, and the benefit remains even when the video quality drops. This approach marks the first successful use of incidental audio for this task outside controlled settings.

Core claim

Passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. A simple audio-visual framework integrates direction-of-arrival spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model, yielding consistent gains over strong visual baselines on two large datasets along with robustness to visual corruption.

What carries the argument

Direction-of-arrival spectra and binauralized embeddings extracted from passive scene audio, integrated into vision-only pose models.

Load-bearing premise

That direction-of-arrival spectra and binauralized embeddings from passive scene sounds can be integrated with visual features to produce consistent and measurable improvements in pose estimation accuracy.

What would settle it

Running the proposed integration on the same datasets and observing no accuracy gains or even losses compared to the vision-only baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.12165 by Daniel Adebi, Kristen Grauman, Sagnik Majumder.

**Figure 1.** Figure 1: Main idea: we propose to estimate relative camera pose from in-the-wild videos using both vision and multichannel audio. Unlike traditional active echolocation sensing, our model relies only on passive scene audio—gaining spatial cues opportunistically from naturally occurring ambient and foreground sound sources. Second, everyday sounds are invariant to some of the key obstacles for traditional vision-onl… view at source ↗

**Figure 2.** Figure 2: We extend the Reloc3r [14] architecture by incorporating both analytical and learned audio embeddings extracted using our Spatial Audio Encoder (SAE). Given a pair of source and target frames, coupled with their corresponding synchronized audio clips, our modified Reloc3r network predicts relative camera poses for the input image pair, in both directions (from source to target, and from target to source). … view at source ↗

**Figure 3.** Figure 3: Qualitative examples of our audio-visual method outperforming a vision-only state-of-the-art relative camera pose estimation [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Our model performance compared against a SOTA vision [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to be the first to fuse passive scene audio with vision for relative camera pose in real-world videos and reports gains plus robustness, but the abstract gives no numbers or method details to check if the audio actually helps.

read the letter

The key takeaway is that this work claims to be the first to use passive scene audio for relative camera pose estimation in real-world videos, integrating DOA spectra and binaural embeddings into a vision model and showing gains plus robustness to visual corruption. They do a good job framing the problem and identifying audio as a potential complementary signal when vision fails due to blur or occlusions. The approach sounds simple and practical, building on existing vision methods rather than starting from scratch. If the experiments hold up, this could open doors for multi-modal systems in robotics and AR. That said, the abstract provides no quantitative results, no dataset names or sizes, no baseline comparisons, and no error breakdowns. This makes it difficult to judge whether the consistent gains are substantial or marginal, and whether they come from genuine audio cues or something else. The concern about DOA estimation in uncontrolled environments is worth pressing: in-the-wild videos often have overlapping sounds, moving sources, and heavy noise, so it's not obvious that reliable direction-of-arrival spectra can be extracted without special conditions. The paper will need to detail the audio processing pipeline and show ablations to address this. Overall, this is aimed at researchers in computer vision and multi-modal learning who are looking for ways to make pose estimation more robust. Readers focused on embodied perception might find the idea useful to build on, even if they end up modifying the audio feature extraction. I would send it for peer review because the core idea is fresh and the problem matters, though the authors should expect questions on the experimental validation and the reliability of the audio signals in practice.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that passive scene sounds supply complementary cues to vision for relative camera pose estimation on in-the-wild videos. It introduces a framework that augments a state-of-the-art vision-only model with direction-of-arrival (DOA) spectra and binauralized embeddings extracted from incidental audio, reporting consistent gains over visual baselines on two large datasets together with improved robustness when visual input is corrupted by blur or occlusion. The work positions itself as the first to successfully leverage everyday audio for this spatial task.

Significance. If the quantitative claims hold under scrutiny, the result would be significant for embodied perception and multi-modal 3D understanding: it demonstrates that readily available passive audio can mitigate well-known failure modes of vision-only pose estimators without requiring additional sensors or active sound sources. The approach could influence downstream applications in robotics and AR where visual degradation is common.

major comments (3)

[§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.
[§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.
[§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.

minor comments (2)

[Figure 2] Figure 2: the diagram of the audio-visual fusion module would benefit from explicit notation for how DOA spectra are concatenated or attended with visual features.
[Related Work] Related Work section: a brief comparison to recent audio-visual localization papers (e.g., those using active sound sources) would strengthen the novelty positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to improve technical clarity, add requested controls, and enhance reproducibility.

read point-by-point responses

Referee: [§3.2] §3.2 (Audio Feature Extraction): the description of DOA spectrum computation omits the specific algorithm (e.g., SRP-PHAT, MUSIC), frequency range, and handling of non-stationary or reverberant sources; because the central claim rests on these spectra providing reliable complementary spatial cues in uncontrolled environments, this omission is load-bearing and prevents assessment of whether reported gains arise from genuine audio geometry or dataset artifacts.

Authors: We agree that the current description in §3.2 lacks sufficient implementation details. In the revised manuscript we will explicitly state that DOA spectra are computed via the SRP-PHAT algorithm over the frequency range 100–8000 Hz, together with the preprocessing steps used to mitigate non-stationary sources and reverberation (short-time windowing and coherence-based masking). These additions will allow readers to confirm that the reported gains derive from genuine spatial geometry rather than dataset-specific artifacts. revision: yes
Referee: [§4.2] §4.2 and Table 2: the reported gains over vision-only baselines are presented without ablation studies that isolate the contribution of DOA spectra versus binaural embeddings, nor with statistical significance tests or cross-dataset variance; without these controls it is impossible to confirm that the improvements are attributable to the audio integration rather than implementation details or dataset biases.

Authors: We acknowledge the need for finer-grained controls. The revised version will include new ablation tables that separately disable DOA spectra and binaural embeddings, paired statistical significance tests (Wilcoxon signed-rank) on the pose-error differences, and per-dataset variance statistics. These additions will directly demonstrate that the observed improvements are attributable to the audio components rather than implementation or dataset biases. revision: yes
Referee: [§4.3] §4.3 (Robustness Experiments): the visual corruption protocols (motion blur, occlusion) are not fully specified with parameters such as kernel size or occlusion ratio, making it difficult to reproduce the robustness results or determine whether the audio features genuinely compensate for the exact degradation levels claimed.

Authors: We agree that the corruption parameters must be fully specified. In the revised §4.3 we will document the exact motion-blur kernel sizes (15×15 and 25×25 pixels), the occlusion ratios (20 % and 40 % random rectangular masks), and the frame-wise application procedure. These details will enable exact reproduction and allow readers to assess the degree to which audio compensates for the stated degradation levels. revision: yes

Circularity Check

0 steps flagged

No circularity in audio-visual pose estimation framework

full rationale

The paper presents an additive framework that extracts DOA spectra and binaural embeddings from passive audio and integrates them into an existing vision-only pose model. Gains are shown empirically on two in-the-wild datasets against visual baselines, with no equations, parameters, or derivations that reduce by construction to fitted inputs or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked to force the result; the central claim rests on standard feature fusion and dataset evaluation rather than tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5462 in / 1122 out tokens · 41878 ms · 2026-05-16T23:11:34.925547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 4 internal anchors

[1]

Self-supervised learning of audio-visual objects from video

Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020. 2

work page 2020
[2]

Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

Relja Arandjelovic and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017. 2

work page 2017
[3]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 2

work page 2022
[4]

Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InNeurIPS, 2021. 2

work page 2021
[5]

Can generative video models help pose estimation? InCVPR, 2025

Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin-Brualla. Can generative video models help pose estimation? InCVPR, 2025. 2

work page 2025
[6]

Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017. 2

work page 2017
[7]

Soundspaces 2.0: A simula- tion platform for visual-acoustic learning

Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robin- son, and Kristen Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning. InNeurIPS 2022 Datasets and Benchmarks Track, 2022. 1, 2, 5, 8

work page 2022
[8]

Novel-view acoustic synthesis

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, and Andrea Vedaldi. Novel-view acoustic synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023. 4

work page 2023
[9]

Sound localization from motion: Jointly learning sound direction and camera rotation

Ziyang Chen, Shengyi Qian, and Andrew Owens. Sound localization from motion: Jointly learning sound direction and camera rotation. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2, 4, 5, 6, 7, 12

work page 2023
[10]

Jesper Haahr Christensen, Sascha Hornauer, and Stella X. Yu. Batvision: Learning to see 3d spatial layout with two ears.2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1581–1587, 2019. 1, 2

work page 2020
[11]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 2

work page 2017
[12]

The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 2

work page 2020
[13]

Fixing the scale and shift in monocular depth for camera pose estimation.arXiv preprint arXiv:2501.07742, 2025

Yaqing Ding, Viktor Kocur, Vaclav Vavra, Zuzana Berger Haladova, Jian Yang, Torsten Sattler, and Zuzana Kukelova. Reposed: Efficient relative pose estimation with known depth information.arXiv preprint arXiv:2501.07742, 2025. 2

work page arXiv 2025
[14]

Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16739–16752, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 13

work page 2025
[15]

Tenenbaum

Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B. Tenenbaum. Look, listen, and act: Towards audio- visual embodied navigation. InICRA, 2020. 1, 2

work page 2020
[16]

2.5d visual sound

Ruohan Gao and Kristen Grauman. 2.5d visual sound. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019. 4, 5, 12

work page 2019
[17]

Visualechoes: Spatial image represen- tation learning through echolocation

Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, and Kristen Grauman. Visualechoes: Spatial image represen- tation learning through echolocation. InECCV, 2020. 1, 2, 4

work page 2020
[18]

Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tu- dor Ionescu, Mario Lucic, Cordelia Schmid, and Anurag Arnab. Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022. 2

work page arXiv 2022
[19]

Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2

work page 2023
[20]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024
[21]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1903
[22]

Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022

Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022. 2

work page arXiv 2022
[23]

Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018

Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018. 2

work page 2018
[24]

Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. InBritish Machine Vision Conference (BMVC), 2018. 2

work page 2018
[25]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 2 9

work page 2041
[26]

Cyclic learning for binaural audio generation and localization

Zhaojian Li, Bin Zhao, and Yuan Yuan. Cyclic learning for binaural audio generation and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26669–26678, 2024. 2

work page 2024
[27]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2

work page 2024
[28]

Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

Jiawei Liu, Zhijie Wang, Lei Ma, Chunrong Fang, Tongtong Bai, Xufan Zhang, Jia Liu, and Zhenyu Chen. Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

work page
[29]

Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny

Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncommon objects in 3d. InarXiv, 2024. 2

work page 2024
[30]

Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,

Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co-visibility segmentation for relative camera pose regression.CoRR, abs/2503.07561, 2025. 2

work page arXiv 2025
[31]

Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024

Sihan Ma, Jing Zhang, Qiong Cao, and Dacheng Tao. Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024. 5

work page arXiv 2024
[32]

Learning spatial features from audio-visual correspondence in egocentric videos

Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. Learning spatial features from audio-visual correspondence in egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 27058–27068, 2024. 4

work page 2024
[33]

Supervising sound localization by in-the-wild egomotion

Anna Min, Ziyang Chen, Hang Zhao, and Andrew Owens. Supervising sound localization by in-the-wild egomotion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23936–23946, 2025. 2

work page 2025
[34]

On in- teraction between augmentations and corruptions in natural corruption robustness

Eric Mintun, Alexander Kirillov, and Saining Xie. On in- teraction between augmentations and corruptions in natural corruption robustness. InAdvances in Neural Information Processing Systems, 2021. 5

work page 2021
[35]

The StreetLearn Environment and Dataset

Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Mali- nowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. The streetlearn environment and dataset.CoRR, abs/1903.01292,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[36]

Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020

Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020. 2

work page 2020
[37]

Morgado, Y

P. Morgado, Y . Li, and N. Vasconcelos. Learning representa- tions from audio-visual spatial alignment. InNeurIPS, 2020. 4

work page 2020
[38]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and A. Ng. Multimodal deep learning. In International Conference on Machine Learning, 2011. 2

work page 2011
[39]

Audio-visual scene analysis with self-supervised multisensory features

Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. 2

work page 2018
[40]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413,

work page
[41]

McDermott, William T

Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. InEuropean Conference on Computer Vision, 2016. 2

work page 2016
[42]

Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention

Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2151–2160, 2022. 4

work page 2022
[43]

pysofaconventions: Python implemen- tation of the SOFA specification, 2019

Andres Perez-Lopez. pysofaconventions: Python implemen- tation of the SOFA specification, 2019. 4

work page 2019
[44]

Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexan- der William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladimir V ondrus, Vincent-Pierre Berges, John Turner, Olek- sandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Ak- shara...

work page 2023
[45]

Senthil Purushwalkam, S. V . A. Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Kumar Gupta, and Kristen Grauman. Audio-visual floorplan reconstruction. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 1163–1172, 2020. 1, 2

work page 2021
[46]

Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wi- jmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew West- bury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI. InThirty-fifth Con- ference on Neural Info...

work page 2021
[47]

Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR,

work page
[48]

Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014

Daniele Salvati, Carlo Drioli, and Gian Luca Foresti. Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014. 4

work page 2014
[49]

SuperGlue: Learning feature match- ing with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 2, 6

work page 2020
[50]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2 10

work page 2019
[51]

R. Schmidt. Multiple emitter location and signal parameter estimation.IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986. 4

work page 1986
[52]

Seitz, and Richard Szeliski

Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. InSIGGRAPH Conference Proceedings, pages 835–846, New York, NY , USA, 2006. ACM Press. 2

work page 2006
[53]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kim- berly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Ba- tra, Hauke ...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[54]

LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021. 2, 6

work page 2021
[55]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

work page 2020
[56]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...

work page 2021
[57]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 2

work page 2016
[58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

work page 2025
[59]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2

work page 2024
[60]

Efficient LoFTR: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. InCVPR, 2024. 6

work page 2024
[61]

Anycam: Learning to re- cover camera poses and intrinsics from casual videos

Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, and Daniel Cremers. Anycam: Learning to re- cover camera poses and intrinsics from casual videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 2

work page 2025
[62]

Binaural audio-visual localization

Xinyi Wu, Zhenyao Wu, Lili Ju, and Song Wang. Binaural audio-visual localization. InAAAI, pages 2961–2968, 2021. 2

work page 2021
[63]

Youtube-vos: Sequence-to-sequence video object segmentation

Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InProceedings of the European conference on computer vision (ECCV), pages 585–601, 2018. 2

work page 2018
[64]

Telling left from right: Learning spatial correspondence of sight and sound

Karren Yang, Bryan Russell, and Justin Salamon. Telling left from right: Learning spatial correspondence of sight and sound. InCVPR, 2020. 4

work page 2020
[65]

Camera pose estimation and localization with active audio sensing

Karren Yang, Michael Firman, Eric Brachmann, and Clement Godard. Camera pose estimation and localization with active audio sensing. InEuropean Conference on Computer Vision, pages 271–291. Springer, 2022. 1, 2

work page 2022
[66]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020
[67]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023
[68]

Geonet: Unsupervised learning of dense depth, optical flow and camera pose

Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

work page 2018
[69]

MonST3r: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025
[70]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2 11 Table 5. Mean of the ground-truth rotation and translation of the camera, expressed using angles, for different activity scenarios. Scenario Mean Rotation (◦) Mean Trans...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Supplementary material 7.1. Implementation Details Here, we provide our model’s training hyperparameters: • Training epochs: 100 • Warmup epochs: 5 • Learning rate:10 −5 • Minimum learning rate:10 −7 • Batch size: 64 • Learning rate scheduler: Cosine annealing Here, we provide the steps and parameters for processing the visual inputs for Reloc3r [14], spe...

work page

[1] [1]

Self-supervised learning of audio-visual objects from video

Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020. 2

work page 2020

[2] [2]

Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017

Relja Arandjelovic and Andrew Zisserman. Look, listen and learn.2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017. 2

work page 2017

[3] [3]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 2

work page 2022

[4] [4]

Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. Arkitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InNeurIPS, 2021. 2

work page 2021

[5] [5]

Can generative video models help pose estimation? InCVPR, 2025

Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin-Brualla. Can generative video models help pose estimation? InCVPR, 2025. 2

work page 2025

[6] [6]

Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.International Conference on 3D Vision (3DV), 2017. 2

work page 2017

[7] [7]

Soundspaces 2.0: A simula- tion platform for visual-acoustic learning

Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robin- son, and Kristen Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning. InNeurIPS 2022 Datasets and Benchmarks Track, 2022. 1, 2, 5, 8

work page 2022

[8] [8]

Novel-view acoustic synthesis

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, and Andrea Vedaldi. Novel-view acoustic synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023. 4

work page 2023

[9] [9]

Sound localization from motion: Jointly learning sound direction and camera rotation

Ziyang Chen, Shengyi Qian, and Andrew Owens. Sound localization from motion: Jointly learning sound direction and camera rotation. InInternational Conference on Computer Vision (ICCV), 2023. 1, 2, 4, 5, 6, 7, 12

work page 2023

[10] [10]

Jesper Haahr Christensen, Sascha Hornauer, and Stella X. Yu. Batvision: Learning to see 3d spatial layout with two ears.2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1581–1587, 2019. 1, 2

work page 2020

[11] [11]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 2

work page 2017

[12] [12]

The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 2

work page 2020

[13] [13]

Fixing the scale and shift in monocular depth for camera pose estimation.arXiv preprint arXiv:2501.07742, 2025

Yaqing Ding, Viktor Kocur, Vaclav Vavra, Zuzana Berger Haladova, Jian Yang, Torsten Sattler, and Zuzana Kukelova. Reposed: Efficient relative pose estimation with known depth information.arXiv preprint arXiv:2501.07742, 2025. 2

work page arXiv 2025

[14] [14]

Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large- scale training of relative camera pose regression for generaliz- able, fast, and accurate visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16739–16752, 2025. 1, 2, 3, 4, 5, 6, 7, 12, 13

work page 2025

[15] [15]

Tenenbaum

Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B. Tenenbaum. Look, listen, and act: Towards audio- visual embodied navigation. InICRA, 2020. 1, 2

work page 2020

[16] [16]

2.5d visual sound

Ruohan Gao and Kristen Grauman. 2.5d visual sound. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019. 4, 5, 12

work page 2019

[17] [17]

Visualechoes: Spatial image represen- tation learning through echolocation

Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, and Kristen Grauman. Visualechoes: Spatial image represen- tation learning through echolocation. InECCV, 2020. 1, 2, 4

work page 2020

[18] [18]

Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tu- dor Ionescu, Mario Lucic, Cordelia Schmid, and Anurag Arnab. Audiovisual masked autoencoders.arXiv preprint arXiv:2212.05922, 2022. 2

work page arXiv 2022

[19] [19]

Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InThe Eleventh International Conference on Learning Representa- tions, 2023. 2

work page 2023

[20] [20]

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...

work page 2024

[21] [21]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1903

[22] [22]

Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022

Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. Mavil: Masked audio-video learners.arXiv preprint arXiv:2212.08071, 2022. 2

work page arXiv 2022

[23] [23]

Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018

Bruno Korbar, Du Tran, and Lorenzo Torresani. Coopera- tive learning of audio and video models from self-supervised synchronization.Advances in Neural Information Processing Systems, 31, 2018. 2

work page 2018

[24] [24]

Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset

Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. InBritish Machine Vision Conference (BMVC), 2018. 2

work page 2018

[25] [25]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 2 9

work page 2041

[26] [26]

Cyclic learning for binaural audio generation and localization

Zhaojian Li, Bin Zhao, and Yuan Yuan. Cyclic learning for binaural audio generation and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26669–26678, 2024. 2

work page 2024

[27] [27]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160– 22169, 2024. 2

work page 2024

[28] [28]

Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

Jiawei Liu, Zhijie Wang, Lei Ma, Chunrong Fang, Tongtong Bai, Xufan Zhang, Jia Liu, and Zhenyu Chen. Benchmarking object detection robustness against real-world corruptions.In- ternational Journal of Computer Vision, 132(10):4398–4416,

work page

[29] [29]

Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny

Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncommon objects in 3d. InarXiv, 2024. 2

work page 2024

[30] [30]

Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,

Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co-visibility segmentation for relative camera pose regression.CoRR, abs/2503.07561, 2025. 2

work page arXiv 2025

[31] [31]

Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024

Sihan Ma, Jing Zhang, Qiong Cao, and Dacheng Tao. Posebench: Benchmarking the robustness of pose estima- tion models under corruptions.ArXiv, abs/2406.14367, 2024. 5

work page arXiv 2024

[32] [32]

Learning spatial features from audio-visual correspondence in egocentric videos

Sagnik Majumder, Ziad Al-Halah, and Kristen Grauman. Learning spatial features from audio-visual correspondence in egocentric videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 27058–27068, 2024. 4

work page 2024

[33] [33]

Supervising sound localization by in-the-wild egomotion

Anna Min, Ziyang Chen, Hang Zhao, and Andrew Owens. Supervising sound localization by in-the-wild egomotion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23936–23946, 2025. 2

work page 2025

[34] [34]

On in- teraction between augmentations and corruptions in natural corruption robustness

Eric Mintun, Alexander Kirillov, and Saining Xie. On in- teraction between augmentations and corruptions in natural corruption robustness. InAdvances in Neural Information Processing Systems, 2021. 5

work page 2021

[35] [35]

The StreetLearn Environment and Dataset

Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Mali- nowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, and Raia Hadsell. The streetlearn environment and dataset.CoRR, abs/1903.01292,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[36] [36]

Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020

Pedro Morgado, Yi Li, and Nuno Nvasconcelos. Self- supervised generation of spatial audio for 360° video.Ad- vances in Neural Information Processing Systems, 33:4733– 4744, 2020. 2

work page 2020

[37] [37]

Morgado, Y

P. Morgado, Y . Li, and N. Vasconcelos. Learning representa- tions from audio-visual spatial alignment. InNeurIPS, 2020. 4

work page 2020

[38] [38]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and A. Ng. Multimodal deep learning. In International Conference on Machine Learning, 2011. 2

work page 2011

[39] [39]

Audio-visual scene analysis with self-supervised multisensory features

Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. InPro- ceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018. 2

work page 2018

[40] [40]

Visually indicated sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Tor- ralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413,

work page

[41] [41]

McDermott, William T

Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. InEuropean Conference on Computer Vision, 2016. 2

work page 2016

[42] [42]

Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention

Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention. In2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2151–2160, 2022. 4

work page 2022

[43] [43]

pysofaconventions: Python implemen- tation of the SOFA specification, 2019

Andres Perez-Lopez. pysofaconventions: Python implemen- tation of the SOFA specification, 2019. 4

work page 2019

[44] [44]

Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

Xavi Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Ruslan Partsey, Jimmy Yang, Ruta Desai, Alexan- der William Clegg, Michal Hlavac, Tiffany Min, Theo Gervet, Vladimir V ondrus, Vincent-Pierre Berges, John Turner, Olek- sandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Ak- shara...

work page 2023

[45] [45]

Senthil Purushwalkam, S. V . A. Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Kumar Gupta, and Kristen Grauman. Audio-visual floorplan reconstruction. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 1163–1172, 2020. 1, 2

work page 2021

[46] [46]

Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wi- jmans, Oleksandr Maksymets, Alexander Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew West- bury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-matterport 3d dataset (HM3d): 1000 large- scale 3d environments for embodied AI. InThirty-fifth Con- ference on Neural Info...

work page 2021

[47] [47]

Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR,

work page

[48] [48]

Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014

Daniele Salvati, Carlo Drioli, and Gian Luca Foresti. Incoher- ent frequency fusion for broadband steered response power algorithms in noisy environments.IEEE Signal Processing Letters, 21(5):581–585, 2014. 4

work page 2014

[49] [49]

SuperGlue: Learning feature match- ing with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 2, 6

work page 2020

[50] [50]

Habitat: A Platform for Embodied AI Research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 2 10

work page 2019

[51] [51]

R. Schmidt. Multiple emitter location and signal parameter estimation.IEEE Transactions on Antennas and Propagation, 34(3):276–280, 1986. 4

work page 1986

[52] [52]

Seitz, and Richard Szeliski

Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: Exploring photo collections in 3d. InSIGGRAPH Conference Proceedings, pages 835–846, New York, NY , USA, 2006. ACM Press. 2

work page 2006

[53] [53]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kim- berly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Ba- tra, Hauke ...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[54] [54]

LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021. 2, 6

work page 2021

[55] [55]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

work page 2020

[56] [56]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...

work page 2021

[57] [57]

Yfcc100m: The new data in multimedia research

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li- Jia Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. 2

work page 2016

[58] [58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

work page 2025

[59] [59]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2

work page 2024

[60] [60]

Efficient LoFTR: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. InCVPR, 2024. 6

work page 2024

[61] [61]

Anycam: Learning to re- cover camera poses and intrinsics from casual videos

Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, and Daniel Cremers. Anycam: Learning to re- cover camera poses and intrinsics from casual videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 2

work page 2025

[62] [62]

Binaural audio-visual localization

Xinyi Wu, Zhenyao Wu, Lili Ju, and Song Wang. Binaural audio-visual localization. InAAAI, pages 2961–2968, 2021. 2

work page 2021

[63] [63]

Youtube-vos: Sequence-to-sequence video object segmentation

Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. Youtube-vos: Sequence-to-sequence video object segmentation. InProceedings of the European conference on computer vision (ECCV), pages 585–601, 2018. 2

work page 2018

[64] [64]

Telling left from right: Learning spatial correspondence of sight and sound

Karren Yang, Bryan Russell, and Justin Salamon. Telling left from right: Learning spatial correspondence of sight and sound. InCVPR, 2020. 4

work page 2020

[65] [65]

Camera pose estimation and localization with active audio sensing

Karren Yang, Michael Firman, Eric Brachmann, and Clement Godard. Camera pose estimation and localization with active audio sensing. InEuropean Conference on Computer Vision, pages 271–291. Springer, 2022. 1, 2

work page 2022

[66] [66]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020

[67] [67]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the International Conference on Computer Vision (ICCV), 2023. 2

work page 2023

[68] [68]

Geonet: Unsupervised learning of dense depth, optical flow and camera pose

Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

work page 2018

[69] [69]

MonST3r: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025

[70] [70]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2 11 Table 5. Mean of the ground-truth rotation and translation of the camera, expressed using angles, for different activity scenarios. Scenario Mean Rotation (◦) Mean Trans...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[71] [71]

Supplementary material 7.1. Implementation Details Here, we provide our model’s training hyperparameters: • Training epochs: 100 • Warmup epochs: 5 • Learning rate:10 −5 • Minimum learning rate:10 −7 • Batch size: 64 • Learning rate scheduler: Cosine annealing Here, we provide the steps and parameters for processing the visual inputs for Reloc3r [14], spe...

work page