arxiv: 2602.06846 · v3 · submitted 2026-02-06 · 💻 cs.SD

Recognition: 2 theorem links

· Lean Theorem

DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Ziyu Luo , Lin Chen , Qiang Qu , Xiaoming Chen , Yiran Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:14 UTC · model grok-4.3

classification 💻 cs.SD

keywords first-order ambisonicsconditional diffusion3D Gaussian splatting360-degree videospatial audio generationdynamic scene reconstructionacoustic modelingimmersive audio

0 comments

The pith

DynFOA generates first-order ambisonics audio for 360-degree videos by conditioning a diffusion model on dynamic 3D scene reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DynFOA, a framework that takes 360-degree video as input and outputs first-order ambisonics (FOA) spatial audio. It first reconstructs the scene using 3D Gaussian Splatting to recover geometry, materials, and dynamic source positions, then feeds those features into a conditional diffusion model that synthesizes audio waveforms matching the visual scene. Existing methods rely only on visual cues and ignore acoustic effects such as occlusion and reverberation; DynFOA explicitly models these interactions through the reconstructed scene representation. The authors introduce a new dataset M2G-360 containing real-world clips with varying numbers of moving sources and geometric complexity to test the approach. Experiments demonstrate improvements in spatial accuracy, acoustic fidelity, and listener immersion compared with prior techniques.

Core claim

DynFOA synthesizes first-order ambisonics by first detecting dynamic sound sources and reconstructing scene geometry and materials via 3D Gaussian Splatting, then conditioning a diffusion model on the resulting physically grounded features to generate audio that respects source motion, occlusion, reflections, and reverberation in the listener's viewpoint.

What carries the argument

A conditional diffusion model whose noise prediction is guided by features extracted from 3D Gaussian Splatting reconstructions of scene geometry, materials, and dynamic source locations.

If this is right

Dynamic source motion and listener viewpoint changes are reflected in the generated ambisonics without separate tracking modules.
Acoustic effects such as occlusion by scene geometry and material-dependent reflections emerge directly from the diffusion conditioning rather than hand-crafted rules.
The same reconstruction pipeline can be reused for other spatial audio formats once the diffusion head is retrained.
Performance holds across single-source, multi-source, and geometrically complex scenes in the M2G-360 evaluation splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to higher-order ambisonics by replacing the FOA decoder with a higher-channel output head while keeping the same scene features.
Real-time deployment would require faster 3D Gaussian Splatting updates or pre-computed scene priors for live 360 streams.
The method opens a path to audio synthesis for arbitrary camera paths in pre-scanned environments without new recordings.

Load-bearing premise

The 3D Gaussian Splatting reconstruction supplies features that accurately encode the acoustic interactions between sound sources, surfaces, and the moving listener viewpoint.

What would settle it

A side-by-side listening test on M2G-360 clips where participants rate DynFOA outputs no higher than a visual-cue-only baseline in perceived spatial accuracy or immersion would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.06846 by Lin Chen, Qiang Qu, Xiaoming Chen, Yiran Shen, Ziyu Luo.

**Figure 1.** Figure 1: The Pipeline of DynFOA for Immersive Spatial Audio Generation. The high-fidelity spatial audio requires understanding [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture Overview of the Proposed DynFOA Backbone. (1) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization comparison of Mel-spectrogram for the FOA channels ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynFOA pairs dynamic 3D Gaussian Splatting scene reconstruction with conditional diffusion to generate first-order ambisonics, but the acoustic validity of those features is not yet demonstrated.

read the letter

The main takeaway is that this paper introduces DynFOA, a pipeline that detects dynamic sound sources in 360 videos, reconstructs scene geometry and materials via 3D Gaussian Splatting, and feeds the resulting features into a conditional diffusion model to synthesize first-order ambisonics. They also release M2G-360, a dataset of 600 real clips split into MoveSources, Multi-Source, and Geometry subsets to test different conditions. The approach moves past purely visual-cue methods by trying to incorporate scene dynamics and acoustic context like occlusion and reverberation. That integration of reconstruction and generative modeling is the clearest step forward, and the dataset subsets show some care in targeting specific robustness issues. The work does a reasonable job laying out a practical end-to-end flow for immersive media applications. The soft spots sit in the central modeling assumption. 3DGS is tuned for photometric view synthesis, not wave propagation, so it is not obvious that the extracted features supply material parameters accurate enough for reflections or absorption. The abstract asserts material estimation and physically grounded features but gives no loss terms, derivations, or comparisons to measured room impulse responses. Without those, gains in spatial accuracy or fidelity could trace to dataset correlations rather than genuine acoustic modeling. The summary also lacks any quantitative metrics, baseline details, or error bars, which leaves the outperformance claim hard to evaluate at this stage. This paper is for researchers working on audio-visual generation, spatial computing, or VR content tools. Someone building systems to add plausible sound to existing 360 footage could pick up usable pieces from the pipeline and the dataset. I would send it to peer review. The combination of reconstruction and diffusion is timely enough to warrant referee attention, even if the experiments will need substantial strengthening on the acoustic side.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DynFOA, a generative framework that reconstructs dynamic 360-degree scenes from video using 3D Gaussian Splatting to extract geometry, depth, semantics, and material features, then conditions a diffusion model on these features to synthesize first-order ambisonics (FOA) audio. It introduces the M2G-360 dataset (600 clips across MoveSources, Multi-Source, and Geometry subsets) and claims consistent outperformance over prior methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersion by modeling dynamic sources and effects such as occlusion and reverberation.

Significance. If the results hold, the integration of 3DGS-derived scene features with conditional diffusion would constitute a useful step toward physically motivated spatial audio generation for complex, dynamic 360-degree content, with potential impact on VR/AR and immersive media pipelines. The new M2G-360 dataset would also provide a reusable benchmark for audio-visual scene understanding.

major comments (2)

[Abstract] Abstract: the central claim of consistent outperformance in spatial accuracy, acoustic fidelity, and distribution matching is stated without any quantitative metrics, baseline names, error bars, statistical tests, or dataset statistics, rendering the experimental support unverifiable from the provided summary.
[Method] Scene reconstruction description: the assertion that 3DGS supplies physically grounded features sufficient to capture acoustic interactions (reflections, occlusion, reverberation) is load-bearing for the conditioning step, yet no material estimation loss, absorption/scattering parameterization, or validation against measured room impulse responses is supplied; standard photometric 3DGS optimization does not guarantee acoustic coefficients.

minor comments (2)

Clarify the exact conditioning mechanism (feature concatenation, cross-attention, etc.) and the diffusion schedule in the main text rather than deferring entirely to supplementary material.
Add explicit dataset statistics (duration, source counts, acoustic complexity labels) and a table comparing all baselines on the three M2G-360 subsets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the scene reconstruction claims. We address each major comment below and will revise the manuscript to improve clarity and verifiability while preserving the core technical contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of consistent outperformance in spatial accuracy, acoustic fidelity, and distribution matching is stated without any quantitative metrics, baseline names, error bars, statistical tests, or dataset statistics, rendering the experimental support unverifiable from the provided summary.

Authors: We agree that the abstract should be more self-contained. In the revised version we will incorporate specific quantitative results drawn from the experiments section, including baseline names (e.g., AV-Ambisonics, Visual2FOA), key metrics such as angular error reduction and Fréchet Audio Distance, and brief dataset statistics for the M2G-360 corpus. This will allow readers to assess the claims directly from the abstract without needing to consult the full paper. revision: yes
Referee: [Method] Scene reconstruction description: the assertion that 3DGS supplies physically grounded features sufficient to capture acoustic interactions (reflections, occlusion, reverberation) is load-bearing for the conditioning step, yet no material estimation loss, absorption/scattering parameterization, or validation against measured room impulse responses is supplied; standard photometric 3DGS optimization does not guarantee acoustic coefficients.

Authors: We acknowledge that standard 3DGS is optimized for photometric reconstruction and does not explicitly optimize acoustic coefficients. In DynFOA, geometry and depth are taken directly from the optimized 3DGS, while material properties are derived from semantic labels via a category-to-absorption mapping table constructed from literature values. We will expand the method section to describe this mapping explicitly, add the corresponding loss formulation if any auxiliary supervision is used, and include a limitations paragraph noting the absence of direct RIR validation on the current dataset. Future work will explore joint audio-visual optimization once suitable paired measurements become available. revision: partial

Circularity Check

0 steps flagged

No circularity: data-driven pipeline with no self-referential derivations or fitted predictions by construction.

full rationale

The paper describes an end-to-end generative pipeline: 3DGS reconstruction of geometry/materials from video, followed by conditioning a diffusion model on the resulting features to synthesize FOA. No equations, uniqueness theorems, or parameter-fitting steps are shown that reduce the output (FOA) to the input by definition or self-citation. The central claims rest on empirical outperformance on the introduced M2G-360 dataset rather than any closed-form derivation. Self-citations are absent from the provided text, and 3DGS is referenced as an external technique. The approach is therefore self-contained as a trained conditional model without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that 3D Gaussian Splatting yields acoustic-relevant geometry and materials, plus standard diffusion-model training assumptions; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption 3D Gaussian Splatting reconstruction supplies physically grounded features sufficient to model occlusion, reflection, and reverberation for audio generation
Invoked to justify conditioning the diffusion model on scene geometry and materials.

pith-pipeline@v0.9.0 · 5575 in / 1344 out tokens · 37304 ms · 2026-05-16T06:14:57.871346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

[1]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 6

work page 1992
[2]

V . R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE workshop on the applications of signal processing to audio and acoustics (Cat. No. 01TH8575), pp. 99–102. IEEE, 2001. 2, 5

work page 2001
[3]

Antani, A

L. Antani, A. Chandak, L. Savioja, and D. Manocha. Interactive sound propagation using compact acoustic transfer operators.ACM Trans- actions on Graphics (TOG), 31(1):1–12, 2012. 4

work page 2012
[4]

D. R. Begault and L. J. Trejo. 3-d sound for virtual reality and multi- media. Technical report, 2000. 1

work page 2000
[5]

Bhosale, H

S. Bhosale, H. Yang, D. Kanojia, J. Deng, and X. Zhu. Av-gs: Learning material and geometry aware priors for novel view acous- tic synthesis.Advances in Neural Information Processing Systems, 37:28920–28937, 2024. 2

work page 2024
[6]

Borsos, M

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. 3

work page
[7]

D. S. Brungart. Near-field virtual audio displays.Presence, 11(1):93– 106, 2002. 2

work page 2002
[8]

Cerd ´a, A

S. Cerd ´a, A. Gim´enez, J. Romero, R. Cibrian, and J. Miralles. Room acoustical parameters: A factor analysis approach.Applied Acoustics, 70(1):97–109, 2009. 8

work page 2009
[9]

C. Chen, U. Jain, C. Schissler, S. V . A. Gari, Z. Al-Halah, V . K. Ithapu, P. Robinson, and K. Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pp. 17–36. Springer, 2020. 2

work page 2020
[10]

C. Chen, C. Schissler, S. Garg, P. Kobernik, A. Clegg, P. Calamia, D. Batra, P. Robinson, and K. Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning.Advances in Neural Infor- mation Processing Systems, 35:8896–8911, 2022. 2, 4

work page 2022
[11]

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large- scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020. 2

work page 2020
[12]

N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. Wavegrad: Estimating gradients for waveform generation. InInterna- tional Conference on Learning Representations. 3

work page
[13]

Z. Chen, G. Gokeda, and Y . Yu.Introduction to Direction-of-arrival Estimation. Artech House, 2010. 8

work page 2010
[14]

H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya, A. Schwing, and Y . Mitsufuji. Mmaudio: Taming multimodal joint training for high- quality video-to-audio synthesis. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pp. 28901–28911, 2025. 3, 6, 7, 8

work page 2025
[15]

Cokelek, H

M. Cokelek, H. Ozsoy, N. Imamoglu, C. Ozcinar, I. Ayhan, E. Erdem, and A. Erdem. Spherical vision transformers for audio-visual saliency prediction in 360-degree videos.IEEE transactions on pattern analy- sis and machine intelligence, 2025. 2

work page 2025
[16]

Copet, F

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez. Simple and controllable music generation.Advances in neural information processing systems, 36:47704–47720, 2023. 3

work page 2023
[17]

Courtney and R

L. Courtney and R. Sreenivas. Using deep convolutional lstm net- works for learning spatiotemporal features. InAsian Conference on Pattern Recognition, pp. 307–320. Springer, 2019. 4

work page 2019
[18]

Curless and M

B. Curless and M. Levoy. A volumetric method for building complex models from range images. InProceedings of the 23rd annual confer- ence on Computer graphics and interactive techniques, pp. 303–312,

work page
[19]

D ´efossez, J

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression.Transactions on Machine Learning Research. 7

work page
[20]

Ephrat, I

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separa- tion.ACM Transactions on Graphics (TOG), 37(4):1–11, 2018. 2

work page 2018
[21]

W. G. Gardner and K. D. Martin. Hrtf measurements of a kemar.The Journal of the Acoustical Society of America, 97(6):3907–3908, 1995. 2

work page 1995
[22]

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017. 2

work page 2017
[23]

Gupta and T

A. Gupta and T. D. Abhayapala. Three-dimensional sound field repro- duction using multiple circular loudspeaker arrays.IEEE Transactions on Audio, Speech, and Language Processing, 19(5):1149–1159, 2010. 2

work page 2010
[24]

Heydari, M

M. Heydari, M. Souden, B. Conejo, and J. Atkins. Immersediffusion: A generative spatial audio latent diffusion model. InICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2025. 3, 5

work page 2025
[25]

S. D. Jepsen, M. G. Christensen, and J. R. Jensen. A study of the scale invariant signal to distortion ratio in speech separation with noisy references.arXiv preprint arXiv:2508.14623, 2025. 8

work page arXiv 2025
[26]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 4

work page 2023
[27]

J. Kim, H. Yun, and G. Kim. Visage: Video-to-spatial audio genera- tion. In13th International Conference on Learning Representations, ICLR 2025, pp. 14239–14259. International Conference on Learning Representations, ICLR, 2025. 2, 3, 5, 6, 7

work page 2025
[28]

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. InInternational Confer- ence on Learning Representations. 3, 6

work page
[29]

Kreuk, G

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi. Audiogen: Textually guided audio generation. InThe Eleventh International Conference on Learning Representations. 3

work page
[30]

S. S. Kushwaha, J. Ma, M. R. Thomas, Y . Tian, and A. Bruni. Diff- sage: End-to-end spatial audio generation using diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2025. 3, 4, 5, 6, 7

work page 2025
[31]

B. Lin, J. Zheng, C. Xue, L. Fu, Y . Li, and Q. Shen. Motion-aware correlation filter-based object tracking in satellite videos.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–13, 2024. 3

work page 2024
[32]

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley. Audioldm: Text-to-audio generation with latent dif- fusion models. InInternational Conference on Machine Learning, pp. 21450–21474. PMLR, 2023. 4, 7

work page 2023
[33]

H. Liu, T. Luo, K. Luo, Q. Jiang, P. Sun, J. Wang, R. Huang, Q. Chen, W. Wang, X. Li, et al. Omniaudio: Generating spatial audio from 360- degree video. InForty-second International Conference on Machine Learning. 2, 3, 4, 5, 6, 7, 8

work page
[34]

H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871–2883, 2024. 2, 3, 4, 6

work page 2024
[35]

Llu ´ıs, V

F. Llu ´ıs, V . Chatziioannou, and A. Hofmann. Points2sound: from mono to binaural audio using 3d point cloud scenes.EURASIP Jour- nal on Audio, Speech, and Music Processing, 2022(1):33, 2022. 2

work page 2022
[36]

Single-Channel Maximum-Likelihood T60 Estimation Exploiting Subband Information

H. Loellmann, A. Brendel, P. Vary, and W. Kellermann. Single- channel maximum-likelihood t60 estimation exploiting subband in- formation.arXiv preprint arXiv:1511.04063, 2015. 8

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

P. C. Loizou.Speech enhancement: theory and practice. CRC press,

work page
[38]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations. 6

work page
[39]

Loshchilov and F

I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representa- tions, 2017. 6

work page 2017
[40]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025. 7

work page 2025
[41]

Majdak, Y

P. Majdak, Y . Iwaya, T. Carpentier, R. Nicol, M. Parmentier, A. Ro- ginska, Y . Suzuki, K. Watanabe, H. Wierstorf, H. Ziegelwanger, et al. Spatially oriented format for acoustics: A data exchange format repre- senting head-related transfer functions. InAudio Engineering Society Convention 134. Audio Engineering Society, 2013. 2, 5

work page 2013
[42]

Micikevicius, S

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. Mixed precision training. InInternational Conference on Learning Repre- sentations, 2018. 6

work page 2018
[43]

Morgado, Y

P. Morgado, Y . Li, and N. Nvasconcelos. Learning representations from audio-visual spatial alignment.Advances in Neural Information Processing Systems, 33:4733–4744, 2020. 5, 6

work page 2020
[44]

Morgado, N

P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang. Self- supervised generation of spatial audio for 360 video.Advances in neural information processing systems, 31, 2018. 2

work page 2018
[45]

Z. Pan, R. Tao, C. Xu, and H. Li. Selective listening by synchroniz- ing speech with lips.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1650–1664, 2022. 2

work page 2022
[46]

R. Panda. Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis. 2

work page
[47]

K. K. Parida, S. Srivastava, and G. Sharma. Beyond mono to binau- ral: Generating binaural audio from mono audio with depth and cross modal attention. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3347–3356, 2022. 2

work page 2022
[48]

Rafaely.Fundamentals of spherical array processing, vol

B. Rafaely.Fundamentals of spherical array processing, vol. 8. Springer, 2015. 5

work page 2015
[49]

Raghuvanshi, J

N. Raghuvanshi, J. Snyder, R. Mehra, M. Lin, and N. Govindaraju. Precomputed wave simulation for real-time sound propagation of dy- namic sources in complex scenes. InACM Siggraph 2010 papers, pp. 1–11. 2010. 3

work page 2010
[50]

Ratnarajah and D

A. Ratnarajah and D. Manocha. Listen2scene: Interactive material- aware binaural sound propagation for reconstructed 3d scenes. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 254–264. IEEE, 2024. 4, 5

work page 2024
[51]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022. 5, 6

work page 2022
[52]

Samavati and M

T. Samavati and M. Soryani. Deep learning-based 3d reconstruction: a survey.Artificial Intelligence Review, 56(9):9175–9219, 2023. 3

work page 2023
[53]

Schissler and D

C. Schissler and D. Manocha. Interactive sound propagation and ren- dering for large multi-source scenes.ACM Transactions on Graphics (TOG), 36(4):1, 2016. 3

work page 2016
[54]

Senocak, T.-H

A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon. Learning to localize sound source in visual scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4358– 4366, 2018. 2

work page 2018
[55]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differ- ential equations. InInternational Conference on Learning Represen- tations. 6

work page
[56]

Tang, H.-Y

Z. Tang, H.-Y . Meng, and D. Manocha. Learning acoustic scattering fields for dynamic interactive sound propagation. In2021 IEEE Vir- tual Reality and 3D User Interfaces (VR), pp. 835–844. IEEE, 2021. 3

work page 2021
[57]

Van der Kelen, P

C. Van der Kelen, P. G ¨oransson, B. Pluymers, and W. Desmet. On the influence of frequency-dependent elastic properties in vibro-acoustic modelling of porous materials under structural excitation.Journal of Sound and Vibration, 333(24):6560–6571, 2014. 4

work page 2014
[58]

W. Wang, M. Feiszli, H. Wang, J. Malik, and D. Tran. Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 4422–4432, 2022. 6

work page 2022
[59]

S. Xie, H. Zhu, T. He, X. Li, and Z. Chen. Sonic4d: Spatial au- dio generation for immersive 4d scene exploration.arXiv preprint arXiv:2506.15759, 2025. 2, 3

work page arXiv 2025
[60]

X. Xu, H. Zhou, Z. Liu, B. Dai, X. Wang, and D. Lin. Visually in- formed binaural audio generation without binaural audios. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pp. 15485–15494, 2021. 2

work page 2021
[61]

Yamamoto, E

R. Yamamoto, E. Song, and J.-M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE, 2020. 8

work page 2020
[62]

Zhang, Q

M. Zhang, Q. Chen, T. Wu, Z. Liu, and D. Lin. Visaudio: End-to- end video-driven binaural spatial audio generation.arXiv preprint arXiv:2512.03036, 2025. 2, 3

work page arXiv 2025
[63]

Zhang, H

X. Zhang, H. Sun, S. Wang, and J. Xu. A new regional localization method for indoor sound source based on convolutional neural net- works.IEEE Access, 6:72073–72082, 2018. 2

work page 2018
[64]

Zotter and M

F. Zotter and M. Frank.Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer, 2019. 5

work page 2019