pith. machine review for the scientific record. sign in

arxiv: 2602.06846 · v3 · submitted 2026-02-06 · 💻 cs.SD

Recognition: 2 theorem links

· Lean Theorem

DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:14 UTC · model grok-4.3

classification 💻 cs.SD
keywords first-order ambisonicsconditional diffusion3D Gaussian splatting360-degree videospatial audio generationdynamic scene reconstructionacoustic modelingimmersive audio
0
0 comments X

The pith

DynFOA generates first-order ambisonics audio for 360-degree videos by conditioning a diffusion model on dynamic 3D scene reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DynFOA, a framework that takes 360-degree video as input and outputs first-order ambisonics (FOA) spatial audio. It first reconstructs the scene using 3D Gaussian Splatting to recover geometry, materials, and dynamic source positions, then feeds those features into a conditional diffusion model that synthesizes audio waveforms matching the visual scene. Existing methods rely only on visual cues and ignore acoustic effects such as occlusion and reverberation; DynFOA explicitly models these interactions through the reconstructed scene representation. The authors introduce a new dataset M2G-360 containing real-world clips with varying numbers of moving sources and geometric complexity to test the approach. Experiments demonstrate improvements in spatial accuracy, acoustic fidelity, and listener immersion compared with prior techniques.

Core claim

DynFOA synthesizes first-order ambisonics by first detecting dynamic sound sources and reconstructing scene geometry and materials via 3D Gaussian Splatting, then conditioning a diffusion model on the resulting physically grounded features to generate audio that respects source motion, occlusion, reflections, and reverberation in the listener's viewpoint.

What carries the argument

A conditional diffusion model whose noise prediction is guided by features extracted from 3D Gaussian Splatting reconstructions of scene geometry, materials, and dynamic source locations.

If this is right

  • Dynamic source motion and listener viewpoint changes are reflected in the generated ambisonics without separate tracking modules.
  • Acoustic effects such as occlusion by scene geometry and material-dependent reflections emerge directly from the diffusion conditioning rather than hand-crafted rules.
  • The same reconstruction pipeline can be reused for other spatial audio formats once the diffusion head is retrained.
  • Performance holds across single-source, multi-source, and geometrically complex scenes in the M2G-360 evaluation splits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to higher-order ambisonics by replacing the FOA decoder with a higher-channel output head while keeping the same scene features.
  • Real-time deployment would require faster 3D Gaussian Splatting updates or pre-computed scene priors for live 360 streams.
  • The method opens a path to audio synthesis for arbitrary camera paths in pre-scanned environments without new recordings.

Load-bearing premise

The 3D Gaussian Splatting reconstruction supplies features that accurately encode the acoustic interactions between sound sources, surfaces, and the moving listener viewpoint.

What would settle it

A side-by-side listening test on M2G-360 clips where participants rate DynFOA outputs no higher than a visual-cue-only baseline in perceived spatial accuracy or immersion would falsify the central claim.

Figures

Figures reproduced from arXiv: 2602.06846 by Lin Chen, Qiang Qu, Xiaoming Chen, Yiran Shen, Ziyu Luo.

Figure 1
Figure 1. Figure 1: The Pipeline of DynFOA for Immersive Spatial Audio Generation. The high-fidelity spatial audio requires understanding [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture Overview of the Proposed DynFOA Backbone. (1) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization comparison of Mel-spectrogram for the FOA channels ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DynFOA, a generative framework that reconstructs dynamic 360-degree scenes from video using 3D Gaussian Splatting to extract geometry, depth, semantics, and material features, then conditions a diffusion model on these features to synthesize first-order ambisonics (FOA) audio. It introduces the M2G-360 dataset (600 clips across MoveSources, Multi-Source, and Geometry subsets) and claims consistent outperformance over prior methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersion by modeling dynamic sources and effects such as occlusion and reverberation.

Significance. If the results hold, the integration of 3DGS-derived scene features with conditional diffusion would constitute a useful step toward physically motivated spatial audio generation for complex, dynamic 360-degree content, with potential impact on VR/AR and immersive media pipelines. The new M2G-360 dataset would also provide a reusable benchmark for audio-visual scene understanding.

major comments (2)
  1. [Abstract] Abstract: the central claim of consistent outperformance in spatial accuracy, acoustic fidelity, and distribution matching is stated without any quantitative metrics, baseline names, error bars, statistical tests, or dataset statistics, rendering the experimental support unverifiable from the provided summary.
  2. [Method] Scene reconstruction description: the assertion that 3DGS supplies physically grounded features sufficient to capture acoustic interactions (reflections, occlusion, reverberation) is load-bearing for the conditioning step, yet no material estimation loss, absorption/scattering parameterization, or validation against measured room impulse responses is supplied; standard photometric 3DGS optimization does not guarantee acoustic coefficients.
minor comments (2)
  1. Clarify the exact conditioning mechanism (feature concatenation, cross-attention, etc.) and the diffusion schedule in the main text rather than deferring entirely to supplementary material.
  2. Add explicit dataset statistics (duration, source counts, acoustic complexity labels) and a table comparing all baselines on the three M2G-360 subsets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the scene reconstruction claims. We address each major comment below and will revise the manuscript to improve clarity and verifiability while preserving the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent outperformance in spatial accuracy, acoustic fidelity, and distribution matching is stated without any quantitative metrics, baseline names, error bars, statistical tests, or dataset statistics, rendering the experimental support unverifiable from the provided summary.

    Authors: We agree that the abstract should be more self-contained. In the revised version we will incorporate specific quantitative results drawn from the experiments section, including baseline names (e.g., AV-Ambisonics, Visual2FOA), key metrics such as angular error reduction and Fréchet Audio Distance, and brief dataset statistics for the M2G-360 corpus. This will allow readers to assess the claims directly from the abstract without needing to consult the full paper. revision: yes

  2. Referee: [Method] Scene reconstruction description: the assertion that 3DGS supplies physically grounded features sufficient to capture acoustic interactions (reflections, occlusion, reverberation) is load-bearing for the conditioning step, yet no material estimation loss, absorption/scattering parameterization, or validation against measured room impulse responses is supplied; standard photometric 3DGS optimization does not guarantee acoustic coefficients.

    Authors: We acknowledge that standard 3DGS is optimized for photometric reconstruction and does not explicitly optimize acoustic coefficients. In DynFOA, geometry and depth are taken directly from the optimized 3DGS, while material properties are derived from semantic labels via a category-to-absorption mapping table constructed from literature values. We will expand the method section to describe this mapping explicitly, add the corresponding loss formulation if any auxiliary supervision is used, and include a limitations paragraph noting the absence of direct RIR validation on the current dataset. Future work will explore joint audio-visual optimization once suitable paired measurements become available. revision: partial

Circularity Check

0 steps flagged

No circularity: data-driven pipeline with no self-referential derivations or fitted predictions by construction.

full rationale

The paper describes an end-to-end generative pipeline: 3DGS reconstruction of geometry/materials from video, followed by conditioning a diffusion model on the resulting features to synthesize FOA. No equations, uniqueness theorems, or parameter-fitting steps are shown that reduce the output (FOA) to the input by definition or self-citation. The central claims rest on empirical outperformance on the introduced M2G-360 dataset rather than any closed-form derivation. Self-citations are absent from the provided text, and 3DGS is referenced as an external technique. The approach is therefore self-contained as a trained conditional model without load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that 3D Gaussian Splatting yields acoustic-relevant geometry and materials, plus standard diffusion-model training assumptions; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption 3D Gaussian Splatting reconstruction supplies physically grounded features sufficient to model occlusion, reflection, and reverberation for audio generation
    Invoked to justify conditioning the diffusion model on scene geometry and materials.

pith-pipeline@v0.9.0 · 5575 in / 1344 out tokens · 37304 ms · 2026-05-16T06:14:57.871346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

  1. [1]

    Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 6

  2. [2]

    V . R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE workshop on the applications of signal processing to audio and acoustics (Cat. No. 01TH8575), pp. 99–102. IEEE, 2001. 2, 5

  3. [3]

    Antani, A

    L. Antani, A. Chandak, L. Savioja, and D. Manocha. Interactive sound propagation using compact acoustic transfer operators.ACM Trans- actions on Graphics (TOG), 31(1):1–12, 2012. 4

  4. [4]

    D. R. Begault and L. J. Trejo. 3-d sound for virtual reality and multi- media. Technical report, 2000. 1

  5. [5]

    Bhosale, H

    S. Bhosale, H. Yang, D. Kanojia, J. Deng, and X. Zhu. Av-gs: Learning material and geometry aware priors for novel view acous- tic synthesis.Advances in Neural Information Processing Systems, 37:28920–28937, 2024. 2

  6. [6]

    Borsos, M

    Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. 3

  7. [7]

    D. S. Brungart. Near-field virtual audio displays.Presence, 11(1):93– 106, 2002. 2

  8. [8]

    Cerd ´a, A

    S. Cerd ´a, A. Gim´enez, J. Romero, R. Cibrian, and J. Miralles. Room acoustical parameters: A factor analysis approach.Applied Acoustics, 70(1):97–109, 2009. 8

  9. [9]

    C. Chen, U. Jain, C. Schissler, S. V . A. Gari, Z. Al-Halah, V . K. Ithapu, P. Robinson, and K. Grauman. Soundspaces: Audio-visual navigation in 3d environments. InEuropean conference on computer vision, pp. 17–36. Springer, 2020. 2

  10. [10]

    C. Chen, C. Schissler, S. Garg, P. Kobernik, A. Clegg, P. Calamia, D. Batra, P. Robinson, and K. Grauman. Soundspaces 2.0: A simula- tion platform for visual-acoustic learning.Advances in Neural Infor- mation Processing Systems, 35:8896–8911, 2022. 2, 4

  11. [11]

    H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large- scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 721–725. IEEE, 2020. 2

  12. [12]

    N. Chen, Y . Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. Wavegrad: Estimating gradients for waveform generation. InInterna- tional Conference on Learning Representations. 3

  13. [13]

    Z. Chen, G. Gokeda, and Y . Yu.Introduction to Direction-of-arrival Estimation. Artech House, 2010. 8

  14. [14]

    H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya, A. Schwing, and Y . Mitsufuji. Mmaudio: Taming multimodal joint training for high- quality video-to-audio synthesis. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pp. 28901–28911, 2025. 3, 6, 7, 8

  15. [15]

    Cokelek, H

    M. Cokelek, H. Ozsoy, N. Imamoglu, C. Ozcinar, I. Ayhan, E. Erdem, and A. Erdem. Spherical vision transformers for audio-visual saliency prediction in 360-degree videos.IEEE transactions on pattern analy- sis and machine intelligence, 2025. 2

  16. [16]

    Copet, F

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D´efossez. Simple and controllable music generation.Advances in neural information processing systems, 36:47704–47720, 2023. 3

  17. [17]

    Courtney and R

    L. Courtney and R. Sreenivas. Using deep convolutional lstm net- works for learning spatiotemporal features. InAsian Conference on Pattern Recognition, pp. 307–320. Springer, 2019. 4

  18. [18]

    Curless and M

    B. Curless and M. Levoy. A volumetric method for building complex models from range images. InProceedings of the 23rd annual confer- ence on Computer graphics and interactive techniques, pp. 303–312,

  19. [19]

    D ´efossez, J

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression.Transactions on Machine Learning Research. 7

  20. [20]

    Ephrat, I

    A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separa- tion.ACM Transactions on Graphics (TOG), 37(4):1–11, 2018. 2

  21. [21]

    W. G. Gardner and K. D. Martin. Hrtf measurements of a kemar.The Journal of the Acoustical Society of America, 97(6):3907–3908, 1995. 2

  22. [22]

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. IEEE, 2017. 2

  23. [23]

    Gupta and T

    A. Gupta and T. D. Abhayapala. Three-dimensional sound field repro- duction using multiple circular loudspeaker arrays.IEEE Transactions on Audio, Speech, and Language Processing, 19(5):1149–1159, 2010. 2

  24. [24]

    Heydari, M

    M. Heydari, M. Souden, B. Conejo, and J. Atkins. Immersediffusion: A generative spatial audio latent diffusion model. InICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2025. 3, 5

  25. [25]

    S. D. Jepsen, M. G. Christensen, and J. R. Jensen. A study of the scale invariant signal to distortion ratio in speech separation with noisy references.arXiv preprint arXiv:2508.14623, 2025. 8

  26. [26]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 4

  27. [27]

    J. Kim, H. Yun, and G. Kim. Visage: Video-to-spatial audio genera- tion. In13th International Conference on Learning Representations, ICLR 2025, pp. 14239–14259. International Conference on Learning Representations, ICLR, 2025. 2, 3, 5, 6, 7

  28. [28]

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. InInternational Confer- ence on Learning Representations. 3, 6

  29. [29]

    Kreuk, G

    F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi. Audiogen: Textually guided audio generation. InThe Eleventh International Conference on Learning Representations. 3

  30. [30]

    S. S. Kushwaha, J. Ma, M. R. Thomas, Y . Tian, and A. Bruni. Diff- sage: End-to-end spatial audio generation using diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2025. 3, 4, 5, 6, 7

  31. [31]

    B. Lin, J. Zheng, C. Xue, L. Fu, Y . Li, and Q. Shen. Motion-aware correlation filter-based object tracking in satellite videos.IEEE Trans- actions on Geoscience and Remote Sensing, 62:1–13, 2024. 3

  32. [32]

    H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley. Audioldm: Text-to-audio generation with latent dif- fusion models. InInternational Conference on Machine Learning, pp. 21450–21474. PMLR, 2023. 4, 7

  33. [33]

    H. Liu, T. Luo, K. Luo, Q. Jiang, P. Sun, J. Wang, R. Huang, Q. Chen, W. Wang, X. Li, et al. Omniaudio: Generating spatial audio from 360- degree video. InForty-second International Conference on Machine Learning. 2, 3, 4, 5, 6, 7, 8

  34. [34]

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871–2883, 2024. 2, 3, 4, 6

  35. [35]

    Llu ´ıs, V

    F. Llu ´ıs, V . Chatziioannou, and A. Hofmann. Points2sound: from mono to binaural audio using 3d point cloud scenes.EURASIP Jour- nal on Audio, Speech, and Music Processing, 2022(1):33, 2022. 2

  36. [36]

    Single-Channel Maximum-Likelihood T60 Estimation Exploiting Subband Information

    H. Loellmann, A. Brendel, P. Vary, and W. Kellermann. Single- channel maximum-likelihood t60 estimation exploiting subband in- formation.arXiv preprint arXiv:1511.04063, 2015. 8

  37. [37]

    P. C. Loizou.Speech enhancement: theory and practice. CRC press,

  38. [38]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations. 6

  39. [39]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representa- tions, 2017. 6

  40. [40]

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Research, 22(4):730–751, 2025. 7

  41. [41]

    Majdak, Y

    P. Majdak, Y . Iwaya, T. Carpentier, R. Nicol, M. Parmentier, A. Ro- ginska, Y . Suzuki, K. Watanabe, H. Wierstorf, H. Ziegelwanger, et al. Spatially oriented format for acoustics: A data exchange format repre- senting head-related transfer functions. InAudio Engineering Society Convention 134. Audio Engineering Society, 2013. 2, 5

  42. [42]

    Micikevicius, S

    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. Mixed precision training. InInternational Conference on Learning Repre- sentations, 2018. 6

  43. [43]

    Morgado, Y

    P. Morgado, Y . Li, and N. Nvasconcelos. Learning representations from audio-visual spatial alignment.Advances in Neural Information Processing Systems, 33:4733–4744, 2020. 5, 6

  44. [44]

    Morgado, N

    P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang. Self- supervised generation of spatial audio for 360 video.Advances in neural information processing systems, 31, 2018. 2

  45. [45]

    Z. Pan, R. Tao, C. Xu, and H. Li. Selective listening by synchroniz- ing speech with lips.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1650–1664, 2022. 2

  46. [46]

    R. Panda. Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis. 2

  47. [47]

    K. K. Parida, S. Srivastava, and G. Sharma. Beyond mono to binau- ral: Generating binaural audio from mono audio with depth and cross modal attention. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3347–3356, 2022. 2

  48. [48]

    Rafaely.Fundamentals of spherical array processing, vol

    B. Rafaely.Fundamentals of spherical array processing, vol. 8. Springer, 2015. 5

  49. [49]

    Raghuvanshi, J

    N. Raghuvanshi, J. Snyder, R. Mehra, M. Lin, and N. Govindaraju. Precomputed wave simulation for real-time sound propagation of dy- namic sources in complex scenes. InACM Siggraph 2010 papers, pp. 1–11. 2010. 3

  50. [50]

    Ratnarajah and D

    A. Ratnarajah and D. Manocha. Listen2scene: Interactive material- aware binaural sound propagation for reconstructed 3d scenes. In2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pp. 254–264. IEEE, 2024. 4, 5

  51. [51]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022. 5, 6

  52. [52]

    Samavati and M

    T. Samavati and M. Soryani. Deep learning-based 3d reconstruction: a survey.Artificial Intelligence Review, 56(9):9175–9219, 2023. 3

  53. [53]

    Schissler and D

    C. Schissler and D. Manocha. Interactive sound propagation and ren- dering for large multi-source scenes.ACM Transactions on Graphics (TOG), 36(4):1, 2016. 3

  54. [54]

    Senocak, T.-H

    A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon. Learning to localize sound source in visual scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 4358– 4366, 2018. 2

  55. [55]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differ- ential equations. InInternational Conference on Learning Represen- tations. 6

  56. [56]

    Tang, H.-Y

    Z. Tang, H.-Y . Meng, and D. Manocha. Learning acoustic scattering fields for dynamic interactive sound propagation. In2021 IEEE Vir- tual Reality and 3D User Interfaces (VR), pp. 835–844. IEEE, 2021. 3

  57. [57]

    Van der Kelen, P

    C. Van der Kelen, P. G ¨oransson, B. Pluymers, and W. Desmet. On the influence of frequency-dependent elastic properties in vibro-acoustic modelling of porous materials under structural excitation.Journal of Sound and Vibration, 333(24):6560–6571, 2014. 4

  58. [58]

    W. Wang, M. Feiszli, H. Wang, J. Malik, and D. Tran. Open-world instance segmentation: Exploiting pseudo ground truth from learned pairwise affinity. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 4422–4432, 2022. 6

  59. [59]

    S. Xie, H. Zhu, T. He, X. Li, and Z. Chen. Sonic4d: Spatial au- dio generation for immersive 4d scene exploration.arXiv preprint arXiv:2506.15759, 2025. 2, 3

  60. [60]

    X. Xu, H. Zhou, Z. Liu, B. Dai, X. Wang, and D. Lin. Visually in- formed binaural audio generation without binaural audios. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pp. 15485–15494, 2021. 2

  61. [61]

    Yamamoto, E

    R. Yamamoto, E. Song, and J.-M. Kim. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. InICASSP 2020-2020 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE, 2020. 8

  62. [62]

    Zhang, Q

    M. Zhang, Q. Chen, T. Wu, Z. Liu, and D. Lin. Visaudio: End-to- end video-driven binaural spatial audio generation.arXiv preprint arXiv:2512.03036, 2025. 2, 3

  63. [63]

    Zhang, H

    X. Zhang, H. Sun, S. Wang, and J. Xu. A new regional localization method for indoor sound source based on convolutional neural net- works.IEEE Access, 6:72073–72082, 2018. 2

  64. [64]

    Zotter and M

    F. Zotter and M. Frank.Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer, 2019. 5