Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Mark Sheinin; Matan Kichler; Shai Bagon

arxiv: 2604.26678 · v1 · submitted 2026-04-29 · 💻 cs.CV

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Shai Bagon , Matan Kichler , Mark Sheinin This is my paper

Pith reviewed 2026-05-07 13:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords sound recoveryvibration sensingspeckle vibrometrymodal analysisvisual microphonessurface vibrationsresonant transfer functionacoustic reconstruction

0 comments

The pith

Multi-point surface vibrations recover original sound by inverting an object's vibrational modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a way to extract scene sound from the surface vibrations of ordinary solid objects that respond weakly or with strong resonance. It uses simultaneous multi-point, multi-axis vibration capture to build a model of how the object's vibrational modes shape the observed signals. This model is inverted to undo the object's resonant filtering and combine the measurements into an estimate of the true sound source. The result extends sound recovery to objects that single-point methods cannot handle well.

Core claim

The authors derive a physics-guided vibration formation model that expresses the captured multi-point multi-axis vibrations as the scene sound source filtered by the object's vibrational modes. Inverting the resonant transfer function derived from this model fuses the multiple vibration signals to recover the original sound waveform, yielding better results than single-point speckle vibrometry or standard multi-signal fusion techniques on solid objects with poor vibration responses.

What carries the argument

The modal vibration formation model that links sound source to multi-point vibrations through the object's vibrational modes and supports explicit inversion of the resonant transfer function.

If this is right

Solid objects with resonant or weak vibration responses become usable as visual microphones.
Fusing multiple surface points improves sound recovery where single-point capture is insufficient.
The method produces an estimate of the scene sound rather than a filtered version distorted by the object.
Recovery succeeds across a wider set of everyday objects without requiring favorable surface properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Objects near a sound source could be used for indirect, non-contact audio capture even when the source itself is not visible.
The approach might generalize to dynamic scenes if modal properties can be tracked over time.
Combining this inversion with other sensing modalities could reduce reliance on direct line-of-sight to the sound emitter.

Load-bearing premise

The object's vibrational modes can be sufficiently captured and modeled from the multi-point measurements to invert the resonant transfer function accurately for arbitrary solid objects.

What would settle it

A side-by-side comparison of the recovered audio waveform against a direct microphone recording of the same scene sound; large waveform mismatch or poor intelligibility would show the modal inversion failed.

Figures

Figures reproduced from arXiv: 2604.26678 by Mark Sheinin, Matan Kichler, Shai Bagon.

**Figure 1.** Figure 1: We introduce a novel approach for sound recovery from multi-point, speckle-based vibration measurements. Our system captures view at source ↗

**Figure 2.** Figure 2: Frequency-dependent coupling of speckle shifts across view at source ↗

**Figure 3.** Figure 3: Robust mode estimation. (a) Initial mode candidates view at source ↗

**Figure 4.** Figure 4: Sound recovery from a drumhead. We capture the view at source ↗

**Figure 6.** Figure 6: The experiment compares reconstructions whose mode view at source ↗

**Figure 5.** Figure 5: Results across objects having various geometries and view at source ↗

**Figure 7.** Figure 7: Comparison between our model-based sound recov view at source ↗

**Figure 8.** Figure 8: Results across objects having various geometries and view at source ↗

read the original abstract

Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant). In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing multiple vibration signals to estimate the original sound source in the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios and other signal-processing-based methods for multi-signal fusing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Modal modeling lets them recover sound from solid objects by fusing multi-point vibrations, but the inversion's reliability for arbitrary cases is not yet clear.

read the letter

The main thing is that this paper shows how to recover sound from everyday solid objects that vibrate poorly or resonantly by capturing multi-point speckle vibrations and inverting the object's modal transfer function. They derive a physics-guided formation model that ties the scene sound to the multi-axis measurements at several surface points through the vibrational modes, then fuse the signals to undo the object's filtering. This moves beyond single-point speckle methods or thin-membrane cases that dominated earlier work. The multi-point approach is the concrete step forward here, and it makes sense as a way to get better observability of the modes. They report that it outperforms baselines on a range of objects, which is the practical claim. The framing of the problem is direct and the model idea is grounded in standard vibration theory, so that part lands cleanly. The soft spot is the limited visibility into how the modes are actually estimated and validated from the data. For arbitrary solids the decomposition could easily be underdetermined with a modest number of points, especially if damping or material variation is high, and the abstract does not show error analysis or independent checks on the recovered modes. That matches the stress-test concern about observability. Without those details the outperformance claim is hard to weigh. This is for researchers working on non-contact sensing or physics-informed audio recovery in computer vision. A reader who follows visual microphone papers would get value from the extension to solid objects and the multi-point fusion idea. It deserves a serious referee because the technical angle is distinct and the application gap is real. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces a physics-guided vibration formation model for recovering scene sound sources from multi-point, multi-axis surface vibrations captured via speckle-based vibrometry on solid objects with poor or highly resonant responses. The model relates the incident sound to observed vibrations through the object's vibrational modes; the resonant transfer function is then inverted by fusing the multi-point signals. Evaluations on everyday objects claim significant outperformance relative to single-point speckle vibrometry and other multi-signal fusion baselines.

Significance. If the central claims hold, the work meaningfully broadens optical sound recovery beyond thin-membrane or self-vibrating objects to a wider class of everyday solids. The explicit incorporation of vibrational modes for transfer-function inversion is a constructive strength over purely empirical fusion methods, and the multi-point speckle setup provides a practical sensing advance. Reproducible evaluation across object types would further strengthen the contribution.

major comments (2)

[§3] §3 (vibration formation model derivation): The central inversion step assumes that vibrational modes (shapes, frequencies, damping) can be recovered sufficiently from the limited multi-point, multi-axis speckle measurements to accurately reverse the object's resonant filtering. The manuscript provides no independent validation or observability analysis of the recovered modes against ground-truth modal parameters for arbitrary solids; without this, the physics-guided claim risks circularity with data-driven fitting of the transfer function.
[§5] §5 (experimental evaluation): While outperformance is reported for challenging resonant objects, the results lack quantitative ablation on the number of surface points or axes required for stable mode estimation, nor do they report mode-recovery error metrics (e.g., frequency or shape reconstruction accuracy) separate from final sound SNR. This leaves the load-bearing assumption untested for objects where surface observability is poor.

minor comments (2)

Notation for the modal expansion and transfer-function matrix should be introduced with explicit dimensions and variable definitions at first use to aid readability.
Figure captions for the multi-point vibration visualizations would benefit from indicating the specific object and sound source used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional validation and analyses as described.

read point-by-point responses

Referee: [§3] §3 (vibration formation model derivation): The central inversion step assumes that vibrational modes (shapes, frequencies, damping) can be recovered sufficiently from the limited multi-point, multi-axis speckle measurements to accurately reverse the object's resonant filtering. The manuscript provides no independent validation or observability analysis of the recovered modes against ground-truth modal parameters for arbitrary solids; without this, the physics-guided claim risks circularity with data-driven fitting of the transfer function.

Authors: We agree that explicit independent validation strengthens the physics-guided claim and reduces the risk of circularity. In the revised manuscript we have added an observability analysis based on the rank of the measurement matrix formed by the multi-point multi-axis observations, showing that the dominant modes become identifiable with as few as four well-placed points. For the evaluated objects we now report a direct comparison between the estimated modal frequencies and independently measured resonant frequencies obtained via separate hammer-impact tests; the mean frequency error is below 3 Hz for the first three modes. While obtaining full ground-truth mode shapes for arbitrary everyday solids remains experimentally challenging without specialized modal-analysis equipment, the added frequency validation and the consistent SNR gains over single-point baselines support the utility of the modal inversion. revision: yes
Referee: [§5] §5 (experimental evaluation): While outperformance is reported for challenging resonant objects, the results lack quantitative ablation on the number of surface points or axes required for stable mode estimation, nor do they report mode-recovery error metrics (e.g., frequency or shape reconstruction accuracy) separate from final sound SNR. This leaves the load-bearing assumption untested for objects where surface observability is poor.

Authors: We acknowledge that separate mode-recovery metrics and systematic ablations were missing. The revised experiments now include (i) an ablation varying the number of surface points (1–9) and axes (single-axis vs. tri-axis) while reporting both final sound SNR and per-mode frequency estimation error (MAE in Hz), and (ii) a discussion of objects with poor surface observability (e.g., highly damped or geometrically complex solids) where mode estimation degrades. These results indicate that at least five points are typically required for stable recovery on resonant objects and quantify the degradation when observability is limited, directly addressing the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard modal analysis

full rationale

The paper derives a physics-guided vibration formation model from established vibrational modes of solid objects and applies it to invert the resonant transfer function by fusing multi-point measurements. No equations or steps in the abstract or description reduce a prediction to a fitted input by construction, nor does the central claim depend on a self-citation chain or self-definitional loop. Mode estimation from speckle data is presented as an input to the inversion rather than being redefined by it, making the derivation self-contained against external physics benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of a derivable physics-guided formation model linking sound to multi-point vibrations via modes; details of mode estimation and inversion assumptions are not provided in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1047 out tokens · 37689 ms · 2026-05-07T13:50:48.515865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

[1]

Imaging with local speckle intensity correlations: the- ory and practice.ACM Transactions on Graphics (TOG), 40 (3):1–22, 2021

Marina Alterman, Chen Bar, Ioannis Gkioulekas, and Anat Levin. Imaging with local speckle intensity correlations: the- ory and practice.ACM Transactions on Graphics (TOG), 40 (3):1–22, 2021. 2

work page 2021
[2]

Springer Berlin, Heidelberg,

Jacob Benesty, Jingdong Chen, and Yiteng Huang.Micro- phone Array Signal Processing. Springer Berlin, Heidelberg,

work page
[3]

John Wiley & Sons Singapore Pte

Jacob Benesty, Israel Cohen, and Jingdong Chen.Funda- mentals of Signal Enhancement and Array Signal Process- ing. John Wiley & Sons Singapore Pte. Ltd., 2017. 2, 6, 7, 11

work page 2017
[4]

Long-range detection of acoustic vibrations by speckle tracking.Applied optics, 58 (28):7805–7809, 2019

S Bianchi and E Giacomozzi. Long-range detection of acoustic vibrations by speckle tracking.Applied optics, 58 (28):7805–7809, 2019. 2

work page 2019
[5]

Sound speeds of solids from ultrasonic pulse receiver measurements

Jack Denman Borg and Dan Dolan. Sound speeds of solids from ultrasonic pulse receiver measurements. Technical re- port, Sandia National Laboratories, 2025. 3

work page 2025
[6]

Estimating the material properties of fabric from video

Katherine L Bouman, Bei Xiao, Peter Battaglia, and William T Freeman. Estimating the material properties of fabric from video. InProceedings of the IEEE international conference on computer vision, pages 1984–1991, 2013. 1

work page 1984
[7]

Ventura.Frequency-Domain Identification, chapter 10, pages 261–280

Rune Brincker and Carlos E. Ventura.Frequency-Domain Identification, chapter 10, pages 261–280. John Wiley & Sons, Ltd, 2015. 4, 5

work page 2015
[8]

Modal identification from ambient responses using frequency do- main decomposition

Rune Brincker, Lingmi Zhang, and Palle Andersen. Modal identification from ambient responses using frequency do- main decomposition. InIMAC 18: Proceedings of the Inter- national Modal Analysis Conference (IMAC), 2000. 4, 5

work page 2000
[9]

Smaller than the eye can see: Vibration analysis with video cameras

Oral Buyukozturk, Justin G Chen, Neal Wadhwa, Abe Davis, Fr´edo Durand, and William T Freeman. Smaller than the eye can see: Vibration analysis with video cameras. InWorld Conference on Non-Destructive Testing 2016, 2016. 1

work page 2016
[10]

Yates, and Laura Waller

Mingxuan Cai, Dekel Galor, Amit Pal Singh Kohli, Jacob L. Yates, and Laura Waller. Event2audio: Event-based opti- cal vibration sensing. InIEEE International Conference on Computational Photography, 2025. 1, 2

work page 2025
[11]

Chen, Neal Wadhwa, Young-Jin Cha, Fr ´edo Du- rand, William T

Justin G. Chen, Neal Wadhwa, Young-Jin Cha, Fr ´edo Du- rand, William T. Freeman, and Oral Buyukozturk. Modal identification of simple structures with high-speed video us- ing motion magnification.Journal of Sound and Vibration, 345:58–71, 2015. 1

work page 2015
[12]

Video camera– based vibration measurement for civil infrastructure applica- tions.Journal of Infrastructure Systems, 23(3):B4016013, 2017

Justin G Chen, Abe Davis, Neal Wadhwa, Fr ´edo Durand, William T Freeman, and Oral B ¨uy¨uk¨ozt¨urk. Video camera– based vibration measurement for civil infrastructure applica- tions.Journal of Infrastructure Systems, 23(3):B4016013, 2017

work page 2017
[13]

Event-based motion magnification

Yutian Chen, Shi Guo, Fangzheng Yu, Feng Zhang, Jinwei Gu, and Tianfan Xue. Event-based motion magnification. In European Conference on Computer Vision, pages 428–444. Springer, 2024. 1

work page 2024
[14]

Speech Processing in Modern Communication

Israel Cohen, Jacob Benesty, and Sharon Gannot, editors. Speech Processing in Modern Communication. Springer Berlin, Heidelberg, 2010. 6

work page 2010
[15]

Lothar Cremer, Manfred Heckl, and Bert A. T. Petersson. Structure-Borne Sound: Structural Vibrations and Sound Radiation at Audio Frequencies. Springer-Verlag Berlin Hei- delberg, 3rd edition, 2005. 3

work page 2005
[16]

The visual microphone: Passive recovery of sound from video.ACM Trans

Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J Mysore, Fredo Durand, and William T Freeman. The visual microphone: Passive recovery of sound from video.ACM Trans. Graph., 2014. 1, 2

work page 2014
[17]

Image-space modal bases for plausible manipulation of objects in video

Abe Davis, Justin G Chen, and Fr ´edo Durand. Image-space modal bases for plausible manipulation of objects in video. ACM Transactions on Graphics (TOG), 34(6):1–7, 2015. 1

work page 2015
[18]

Video magnification in presence of large motions

Mohamed Elgharib, Mohamed Hefeeda, Fredo Durand, and William T Freeman. Video magnification in presence of large motions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4119– 4127, 2015. 1

work page 2015
[19]

Ewins.Modal Testing: Theory, Practice and Appli- cation

David J. Ewins.Modal Testing: Theory, Practice and Appli- cation. Research Studies Press, 2nd edition, 2000. 3, 5

work page 2000
[20]

Prentice Hall, 1984

Simon Haykin, editor.Array Signal Processing. Prentice Hall, 1984. 6

work page 1984
[21]

Modal analysis methods – fre- quency domain

Jimin He and Zhi-Fang Fu. Modal analysis methods – fre- quency domain. InModal Analysis, chapter 8, pages 159–

work page
[22]

Butterworth-Heinemann, Oxford, 2001. 4

work page 2001
[23]

Speech intelligibility pre- diction using a neurogram similarity index measure.Speech Communication, 54(2):306–320, 2012

Andrew Hines and Naomi Harte. Speech intelligibility pre- diction using a neurogram similarity index measure.Speech Communication, 54(2):306–320, 2012. 11, 12

work page 2012
[24]

ViSQOLAudio: An objec- tive audio quality metric for low bitrate codecs.The Journal of the Acoustical Society of America, 137(6):EL449–EL455,

Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil Kokaram, and Naomi Harte. ViSQOLAudio: An objec- tive audio quality metric for low bitrate codecs.The Journal of the Acoustical Society of America, 137(6):EL449–EL455,

work page
[25]

DE-R 351 Diffractive Optical Element.https://holoeye.com/product/de-r- 351/, 2023

HOLOEYE Photonics AG. DE-R 351 Diffractive Optical Element.https://holoeye.com/product/de-r- 351/, 2023. Accessed: 2025-11-11. 6

work page 2023
[26]

Event-based vi- sual microphone

Matthew Howard and Keigo Hirakawa. Event-based vi- sual microphone. InICASSP 2023 - 2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5, 2023. 2

work page 2023
[27]

Kensei Jo, Mohit Gupta, and Shree K. Nayar. Spedo: 6 dof ego-motion sensor using speckle defocus imaging. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2

work page 2015
[28]

Can one hear the shape of a drum?The american mathematical monthly, 1966

Mark Kac. Can one hear the shape of a drum?The american mathematical monthly, 1966. 1

work page 1966
[29]

Learning to see inside opaque liquid containers using speckle vibrometry

Matan Kichler, Shai Bagon, and Mark Sheinin. Learning to see inside opaque liquid containers using speckle vibrometry. InInt. Conf. Comput. Vis., 2025. 2, 6

work page 2025
[30]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 6

work page 2015
[31]

Motion magnification.ACM transactions on graphics (TOG), 24(3):519–526, 2005

Ce Liu, Antonio Torralba, William T Freeman, Fr ´edo Du- rand, and Edward H Adelson. Motion magnification.ACM transactions on graphics (TOG), 24(3):519–526, 2005. 1, 2

work page 2005
[32]

Meirovitch.Fundamentals of Vibrations

L. Meirovitch.Fundamentals of Vibrations. McGraw-Hill,

work page
[33]

Lamphone: Real-time passive sound recovery from light bulb vibrations.Cryptology ePrint Archive, 2020

Ben Nassi, Yaron Pirutin, Adi Shamir, Yuval Elovici, and Boris Zadov. Lamphone: Real-time passive sound recovery from light bulb vibrations.Cryptology ePrint Archive, 2020. 1 9

work page 2020
[34]

Live demonstration: Event-based visual micro- phone

Ryogo Niwa, Tatsuki Fushimi, Kenta Yamamoto, and Yoichi Ochiai. Live demonstration: Event-based visual micro- phone. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 4054–4055, 2023. 2

work page 2023
[35]

Learning-based video motion magnification

Tae-Hyun Oh, Ronnachai Jaroensri, Changil Kim, Mohamed Elgharib, Fr’edo Durand, William T Freeman, and Wojciech Matusik. Learning-based video motion magnification. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 633–648, 2018. 1, 2

work page 2018
[36]

Rao.Vibration of Continuous Systems

Singiresu S. Rao.Vibration of Continuous Systems. John Wiley & Sons, 2007. 3

work page 2007
[37]

Richards.Fundamentals of Radar Signal Process- ing

Mark A. Richards.Fundamentals of Radar Signal Process- ing. McGraw-Hill Education, 2nd edition, 2014. 6

work page 2014
[38]

Smoothing and differentiation of data by simplified least squares procedures

Abraham Savitzky and Marcel JE Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry, 1964. 5

work page 1964
[39]

Narasimhan

Mark Sheinin, Dorian Chan, Matthew O’Toole, and Srini- vasa G. Narasimhan. Dual-shutter optical vibration sensing. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 1, 2

work page 2022
[40]

Smith, Pratham Desai, Vishal Agarwal, and Mo- hit Gupta

Brandon M. Smith, Pratham Desai, Vishal Agarwal, and Mo- hit Gupta. Colux: multi-object 3d micro-motion analysis us- ing speckle imaging.ACM Trans. Graph., 36(4), 2017

work page 2017
[41]

Smith, Matthew O’Toole, and Mohit Gupta

Brandon M. Smith, Matthew O’Toole, and Mohit Gupta. Tracking multiple objects outside the line of sight using speckle imaging. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

work page 2018
[42]

Steinmetz and Joshua D

Christian J. Steinmetz and Joshua D. Reiss. auraloss: Audio focused loss functions in PyTorch. InDigital Music Research Network One-day Workshop (DMRN+15), 2020. 11, 13

work page 2020
[43]

Sullivan.Practical Array Processing

Mark C. Sullivan.Practical Array Processing. McGraw Hill,

work page
[44]

Woinowsky-Krieger.Theory of Plates and Shells

Stephen Timoshenko and S. Woinowsky-Krieger.Theory of Plates and Shells. McGraw-Hill, 2nd edition, 1959. 3

work page 1959
[45]

SciPy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 2020

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. SciPy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 2020. 5

work page 2020
[46]

Phase-based video motion processing

Neal Wadhwa, Michael Rubinstein, Fr ´edo Durand, and William T Freeman. Phase-based video motion processing. ACM Transactions on Graphics (TOG), 32(4):1–10, 2013. 1, 2

work page 2013
[47]

Riesz pyramids for fast phase-based video magnification

Neal Wadhwa, Michael Rubinstein, Fr ´edo Durand, and William T Freeman. Riesz pyramids for fast phase-based video magnification. InIEEE International Conference on Computational Photography, pages 1–10. IEEE, 2014

work page 2014
[48]

Eu- lerian video magnification and analysis.Communications of the ACM, 60(1):87–95, 2016

Neal Wadhwa, Hao-Yu Wu, Abe Davis, Michael Rubin- stein, Eugene Shih, Gautham J Mysore, Justin G Chen, Oral Buyukozturk, John V Guttag, William T Freeman, et al. Eu- lerian video magnification and analysis.Communications of the ACM, 60(1):87–95, 2016. 1, 2

work page 2016
[49]

Phase-coherent multi-sensor synthesis for enhanced photoa- coustic imaging: a comprehensive framework for optimal sensor integration.Biomed

Chaoneng Wu, Wei Li, Yizhi Liang, Peiqian He, Changze Song, Xue Bai, Linghao Cheng, Long Jin, and Bai-Ou Guan. Phase-coherent multi-sensor synthesis for enhanced photoa- coustic imaging: a comprehensive framework for optimal sensor integration.Biomed. Opt. Express, 16(5):1909–1924,

work page 1909
[50]

Eulerian video mag- nification for revealing subtle changes in the world.ACM transactions on graphics (TOG), 31(4):1–8, 2012

Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Fr´edo Durand, and William Freeman. Eulerian video mag- nification for revealing subtle changes in the world.ACM transactions on graphics (TOG), 31(4):1–8, 2012. 1, 2

work page 2012
[51]

Fast motion estimation of one-dimensional laser speckle image and its application on real-time audio signal acquisition

Nan Wu and Shinichiro Haruyama. Fast motion estimation of one-dimensional laser speckle image and its application on real-time audio signal acquisition. In2020 the 6th In- ternational Conference on Communication and Information Processing, pages 128–134, 2020. 2

work page 2020
[52]

The 20k samples-per- second real time detection of acoustic vibration based on dis- placement estimation of one-dimensional laser speckle im- ages.Sensors, 21(9):2938, 2021

Nan Wu and Shinichiro Haruyama. The 20k samples-per- second real time detection of acoustic vibration based on dis- placement estimation of one-dimensional laser speckle im- ages.Sensors, 21(9):2938, 2021. 2

work page 2021
[53]

Paral- lel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spec- trogram

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Paral- lel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spec- trogram. InIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. 11, 13

work page 2020
[54]

Simultaneous remote extraction of multiple speech sources and heart beats from secondary speckles pattern.Op- tics express, 17(24):21566–21580, 2009

Zeev Zalevsky, Yevgeny Beiderman, Israel Margalit, Shimshon Gingold, Mina Teicher, Vicente Mico, and Javier Garcia. Simultaneous remote extraction of multiple speech sources and heart beats from secondary speckles pattern.Op- tics express, 17(24):21566–21580, 2009. 1, 2

work page 2009
[55]

Narasimhan

Tianyuan Zhang, Mark Sheinin, Dorian Chan, Mark Rau, Matthew O’Toole, and Srinivasa G. Narasimhan. Analyz- ing physical impacts using transient surface wave imaging. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 2, 4

work page 2023
[56]

Video acceleration magnification

Yichao Zhang, Silvia L Pintea, and Jan C Van Gemert. Video acceleration magnification. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 529–537, 2017. 1

work page 2017
[57]

PhD thesis, Universit´e d’Ottawa/University of Ot- tawa, 2016

Meng Zhou.Vibration Extraction Using Rolling Shutter Cameras. PhD thesis, Universit´e d’Ottawa/University of Ot- tawa, 2016. 1

work page 2016
[58]

Event-based visual vibrometry

Xinyu Zhou, Peiqi Duan, Yeliduosi Xiaokaiti, Chao Xu, and Boxin Shi. Event-based visual vibrometry. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 24666–24676, 2025. 2 10 A. Spatially varying optical transfer and mode shape estimation In Sec. 3.2 of the main manuscript, we simplified the rela- tionship between the ...

work page 2025

[1] [1]

Imaging with local speckle intensity correlations: the- ory and practice.ACM Transactions on Graphics (TOG), 40 (3):1–22, 2021

Marina Alterman, Chen Bar, Ioannis Gkioulekas, and Anat Levin. Imaging with local speckle intensity correlations: the- ory and practice.ACM Transactions on Graphics (TOG), 40 (3):1–22, 2021. 2

work page 2021

[2] [2]

Springer Berlin, Heidelberg,

Jacob Benesty, Jingdong Chen, and Yiteng Huang.Micro- phone Array Signal Processing. Springer Berlin, Heidelberg,

work page

[3] [3]

John Wiley & Sons Singapore Pte

Jacob Benesty, Israel Cohen, and Jingdong Chen.Funda- mentals of Signal Enhancement and Array Signal Process- ing. John Wiley & Sons Singapore Pte. Ltd., 2017. 2, 6, 7, 11

work page 2017

[4] [4]

Long-range detection of acoustic vibrations by speckle tracking.Applied optics, 58 (28):7805–7809, 2019

S Bianchi and E Giacomozzi. Long-range detection of acoustic vibrations by speckle tracking.Applied optics, 58 (28):7805–7809, 2019. 2

work page 2019

[5] [5]

Sound speeds of solids from ultrasonic pulse receiver measurements

Jack Denman Borg and Dan Dolan. Sound speeds of solids from ultrasonic pulse receiver measurements. Technical re- port, Sandia National Laboratories, 2025. 3

work page 2025

[6] [6]

Estimating the material properties of fabric from video

Katherine L Bouman, Bei Xiao, Peter Battaglia, and William T Freeman. Estimating the material properties of fabric from video. InProceedings of the IEEE international conference on computer vision, pages 1984–1991, 2013. 1

work page 1984

[7] [7]

Ventura.Frequency-Domain Identification, chapter 10, pages 261–280

Rune Brincker and Carlos E. Ventura.Frequency-Domain Identification, chapter 10, pages 261–280. John Wiley & Sons, Ltd, 2015. 4, 5

work page 2015

[8] [8]

Modal identification from ambient responses using frequency do- main decomposition

Rune Brincker, Lingmi Zhang, and Palle Andersen. Modal identification from ambient responses using frequency do- main decomposition. InIMAC 18: Proceedings of the Inter- national Modal Analysis Conference (IMAC), 2000. 4, 5

work page 2000

[9] [9]

Smaller than the eye can see: Vibration analysis with video cameras

Oral Buyukozturk, Justin G Chen, Neal Wadhwa, Abe Davis, Fr´edo Durand, and William T Freeman. Smaller than the eye can see: Vibration analysis with video cameras. InWorld Conference on Non-Destructive Testing 2016, 2016. 1

work page 2016

[10] [10]

Yates, and Laura Waller

Mingxuan Cai, Dekel Galor, Amit Pal Singh Kohli, Jacob L. Yates, and Laura Waller. Event2audio: Event-based opti- cal vibration sensing. InIEEE International Conference on Computational Photography, 2025. 1, 2

work page 2025

[11] [11]

Chen, Neal Wadhwa, Young-Jin Cha, Fr ´edo Du- rand, William T

Justin G. Chen, Neal Wadhwa, Young-Jin Cha, Fr ´edo Du- rand, William T. Freeman, and Oral Buyukozturk. Modal identification of simple structures with high-speed video us- ing motion magnification.Journal of Sound and Vibration, 345:58–71, 2015. 1

work page 2015

[12] [12]

Video camera– based vibration measurement for civil infrastructure applica- tions.Journal of Infrastructure Systems, 23(3):B4016013, 2017

Justin G Chen, Abe Davis, Neal Wadhwa, Fr ´edo Durand, William T Freeman, and Oral B ¨uy¨uk¨ozt¨urk. Video camera– based vibration measurement for civil infrastructure applica- tions.Journal of Infrastructure Systems, 23(3):B4016013, 2017

work page 2017

[13] [13]

Event-based motion magnification

Yutian Chen, Shi Guo, Fangzheng Yu, Feng Zhang, Jinwei Gu, and Tianfan Xue. Event-based motion magnification. In European Conference on Computer Vision, pages 428–444. Springer, 2024. 1

work page 2024

[14] [14]

Speech Processing in Modern Communication

Israel Cohen, Jacob Benesty, and Sharon Gannot, editors. Speech Processing in Modern Communication. Springer Berlin, Heidelberg, 2010. 6

work page 2010

[15] [15]

Lothar Cremer, Manfred Heckl, and Bert A. T. Petersson. Structure-Borne Sound: Structural Vibrations and Sound Radiation at Audio Frequencies. Springer-Verlag Berlin Hei- delberg, 3rd edition, 2005. 3

work page 2005

[16] [16]

The visual microphone: Passive recovery of sound from video.ACM Trans

Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham J Mysore, Fredo Durand, and William T Freeman. The visual microphone: Passive recovery of sound from video.ACM Trans. Graph., 2014. 1, 2

work page 2014

[17] [17]

Image-space modal bases for plausible manipulation of objects in video

Abe Davis, Justin G Chen, and Fr ´edo Durand. Image-space modal bases for plausible manipulation of objects in video. ACM Transactions on Graphics (TOG), 34(6):1–7, 2015. 1

work page 2015

[18] [18]

Video magnification in presence of large motions

Mohamed Elgharib, Mohamed Hefeeda, Fredo Durand, and William T Freeman. Video magnification in presence of large motions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4119– 4127, 2015. 1

work page 2015

[19] [19]

Ewins.Modal Testing: Theory, Practice and Appli- cation

David J. Ewins.Modal Testing: Theory, Practice and Appli- cation. Research Studies Press, 2nd edition, 2000. 3, 5

work page 2000

[20] [20]

Prentice Hall, 1984

Simon Haykin, editor.Array Signal Processing. Prentice Hall, 1984. 6

work page 1984

[21] [21]

Modal analysis methods – fre- quency domain

Jimin He and Zhi-Fang Fu. Modal analysis methods – fre- quency domain. InModal Analysis, chapter 8, pages 159–

work page

[22] [22]

Butterworth-Heinemann, Oxford, 2001. 4

work page 2001

[23] [23]

Speech intelligibility pre- diction using a neurogram similarity index measure.Speech Communication, 54(2):306–320, 2012

Andrew Hines and Naomi Harte. Speech intelligibility pre- diction using a neurogram similarity index measure.Speech Communication, 54(2):306–320, 2012. 11, 12

work page 2012

[24] [24]

ViSQOLAudio: An objec- tive audio quality metric for low bitrate codecs.The Journal of the Acoustical Society of America, 137(6):EL449–EL455,

Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil Kokaram, and Naomi Harte. ViSQOLAudio: An objec- tive audio quality metric for low bitrate codecs.The Journal of the Acoustical Society of America, 137(6):EL449–EL455,

work page

[25] [25]

DE-R 351 Diffractive Optical Element.https://holoeye.com/product/de-r- 351/, 2023

HOLOEYE Photonics AG. DE-R 351 Diffractive Optical Element.https://holoeye.com/product/de-r- 351/, 2023. Accessed: 2025-11-11. 6

work page 2023

[26] [26]

Event-based vi- sual microphone

Matthew Howard and Keigo Hirakawa. Event-based vi- sual microphone. InICASSP 2023 - 2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5, 2023. 2

work page 2023

[27] [27]

Kensei Jo, Mohit Gupta, and Shree K. Nayar. Spedo: 6 dof ego-motion sensor using speckle defocus imaging. InPro- ceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2

work page 2015

[28] [28]

Can one hear the shape of a drum?The american mathematical monthly, 1966

Mark Kac. Can one hear the shape of a drum?The american mathematical monthly, 1966. 1

work page 1966

[29] [29]

Learning to see inside opaque liquid containers using speckle vibrometry

Matan Kichler, Shai Bagon, and Mark Sheinin. Learning to see inside opaque liquid containers using speckle vibrometry. InInt. Conf. Comput. Vis., 2025. 2, 6

work page 2025

[30] [30]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 6

work page 2015

[31] [31]

Motion magnification.ACM transactions on graphics (TOG), 24(3):519–526, 2005

Ce Liu, Antonio Torralba, William T Freeman, Fr ´edo Du- rand, and Edward H Adelson. Motion magnification.ACM transactions on graphics (TOG), 24(3):519–526, 2005. 1, 2

work page 2005

[32] [32]

Meirovitch.Fundamentals of Vibrations

L. Meirovitch.Fundamentals of Vibrations. McGraw-Hill,

work page

[33] [33]

Lamphone: Real-time passive sound recovery from light bulb vibrations.Cryptology ePrint Archive, 2020

Ben Nassi, Yaron Pirutin, Adi Shamir, Yuval Elovici, and Boris Zadov. Lamphone: Real-time passive sound recovery from light bulb vibrations.Cryptology ePrint Archive, 2020. 1 9

work page 2020

[34] [34]

Live demonstration: Event-based visual micro- phone

Ryogo Niwa, Tatsuki Fushimi, Kenta Yamamoto, and Yoichi Ochiai. Live demonstration: Event-based visual micro- phone. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Work- shops, pages 4054–4055, 2023. 2

work page 2023

[35] [35]

Learning-based video motion magnification

Tae-Hyun Oh, Ronnachai Jaroensri, Changil Kim, Mohamed Elgharib, Fr’edo Durand, William T Freeman, and Wojciech Matusik. Learning-based video motion magnification. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 633–648, 2018. 1, 2

work page 2018

[36] [36]

Rao.Vibration of Continuous Systems

Singiresu S. Rao.Vibration of Continuous Systems. John Wiley & Sons, 2007. 3

work page 2007

[37] [37]

Richards.Fundamentals of Radar Signal Process- ing

Mark A. Richards.Fundamentals of Radar Signal Process- ing. McGraw-Hill Education, 2nd edition, 2014. 6

work page 2014

[38] [38]

Smoothing and differentiation of data by simplified least squares procedures

Abraham Savitzky and Marcel JE Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry, 1964. 5

work page 1964

[39] [39]

Narasimhan

Mark Sheinin, Dorian Chan, Matthew O’Toole, and Srini- vasa G. Narasimhan. Dual-shutter optical vibration sensing. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 1, 2

work page 2022

[40] [40]

Smith, Pratham Desai, Vishal Agarwal, and Mo- hit Gupta

Brandon M. Smith, Pratham Desai, Vishal Agarwal, and Mo- hit Gupta. Colux: multi-object 3d micro-motion analysis us- ing speckle imaging.ACM Trans. Graph., 36(4), 2017

work page 2017

[41] [41]

Smith, Matthew O’Toole, and Mohit Gupta

Brandon M. Smith, Matthew O’Toole, and Mohit Gupta. Tracking multiple objects outside the line of sight using speckle imaging. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 2

work page 2018

[42] [42]

Steinmetz and Joshua D

Christian J. Steinmetz and Joshua D. Reiss. auraloss: Audio focused loss functions in PyTorch. InDigital Music Research Network One-day Workshop (DMRN+15), 2020. 11, 13

work page 2020

[43] [43]

Sullivan.Practical Array Processing

Mark C. Sullivan.Practical Array Processing. McGraw Hill,

work page

[44] [44]

Woinowsky-Krieger.Theory of Plates and Shells

Stephen Timoshenko and S. Woinowsky-Krieger.Theory of Plates and Shells. McGraw-Hill, 2nd edition, 1959. 3

work page 1959

[45] [45]

SciPy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 2020

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. SciPy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 2020. 5

work page 2020

[46] [46]

Phase-based video motion processing

Neal Wadhwa, Michael Rubinstein, Fr ´edo Durand, and William T Freeman. Phase-based video motion processing. ACM Transactions on Graphics (TOG), 32(4):1–10, 2013. 1, 2

work page 2013

[47] [47]

Riesz pyramids for fast phase-based video magnification

Neal Wadhwa, Michael Rubinstein, Fr ´edo Durand, and William T Freeman. Riesz pyramids for fast phase-based video magnification. InIEEE International Conference on Computational Photography, pages 1–10. IEEE, 2014

work page 2014

[48] [48]

Eu- lerian video magnification and analysis.Communications of the ACM, 60(1):87–95, 2016

Neal Wadhwa, Hao-Yu Wu, Abe Davis, Michael Rubin- stein, Eugene Shih, Gautham J Mysore, Justin G Chen, Oral Buyukozturk, John V Guttag, William T Freeman, et al. Eu- lerian video magnification and analysis.Communications of the ACM, 60(1):87–95, 2016. 1, 2

work page 2016

[49] [49]

Phase-coherent multi-sensor synthesis for enhanced photoa- coustic imaging: a comprehensive framework for optimal sensor integration.Biomed

Chaoneng Wu, Wei Li, Yizhi Liang, Peiqian He, Changze Song, Xue Bai, Linghao Cheng, Long Jin, and Bai-Ou Guan. Phase-coherent multi-sensor synthesis for enhanced photoa- coustic imaging: a comprehensive framework for optimal sensor integration.Biomed. Opt. Express, 16(5):1909–1924,

work page 1909

[50] [50]

Eulerian video mag- nification for revealing subtle changes in the world.ACM transactions on graphics (TOG), 31(4):1–8, 2012

Hao-Yu Wu, Michael Rubinstein, Eugene Shih, John Guttag, Fr´edo Durand, and William Freeman. Eulerian video mag- nification for revealing subtle changes in the world.ACM transactions on graphics (TOG), 31(4):1–8, 2012. 1, 2

work page 2012

[51] [51]

Fast motion estimation of one-dimensional laser speckle image and its application on real-time audio signal acquisition

Nan Wu and Shinichiro Haruyama. Fast motion estimation of one-dimensional laser speckle image and its application on real-time audio signal acquisition. In2020 the 6th In- ternational Conference on Communication and Information Processing, pages 128–134, 2020. 2

work page 2020

[52] [52]

The 20k samples-per- second real time detection of acoustic vibration based on dis- placement estimation of one-dimensional laser speckle im- ages.Sensors, 21(9):2938, 2021

Nan Wu and Shinichiro Haruyama. The 20k samples-per- second real time detection of acoustic vibration based on dis- placement estimation of one-dimensional laser speckle im- ages.Sensors, 21(9):2938, 2021. 2

work page 2021

[53] [53]

Paral- lel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spec- trogram

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Paral- lel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spec- trogram. InIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. 11, 13

work page 2020

[54] [54]

Simultaneous remote extraction of multiple speech sources and heart beats from secondary speckles pattern.Op- tics express, 17(24):21566–21580, 2009

Zeev Zalevsky, Yevgeny Beiderman, Israel Margalit, Shimshon Gingold, Mina Teicher, Vicente Mico, and Javier Garcia. Simultaneous remote extraction of multiple speech sources and heart beats from secondary speckles pattern.Op- tics express, 17(24):21566–21580, 2009. 1, 2

work page 2009

[55] [55]

Narasimhan

Tianyuan Zhang, Mark Sheinin, Dorian Chan, Mark Rau, Matthew O’Toole, and Srinivasa G. Narasimhan. Analyz- ing physical impacts using transient surface wave imaging. InIEEE Conf. Comput. Vis. Pattern Recog., 2023. 2, 4

work page 2023

[56] [56]

Video acceleration magnification

Yichao Zhang, Silvia L Pintea, and Jan C Van Gemert. Video acceleration magnification. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 529–537, 2017. 1

work page 2017

[57] [57]

PhD thesis, Universit´e d’Ottawa/University of Ot- tawa, 2016

Meng Zhou.Vibration Extraction Using Rolling Shutter Cameras. PhD thesis, Universit´e d’Ottawa/University of Ot- tawa, 2016. 1

work page 2016

[58] [58]

Event-based visual vibrometry

Xinyu Zhou, Peiqi Duan, Yeliduosi Xiaokaiti, Chao Xu, and Boxin Shi. Event-based visual vibrometry. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 24666–24676, 2025. 2 10 A. Spatially varying optical transfer and mode shape estimation In Sec. 3.2 of the main manuscript, we simplified the rela- tionship between the ...

work page 2025