EgoSound: Benchmarking Sound Understanding in Egocentric Videos

· 2026 · cs.CV · arXiv 2602.14122

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world. Project page: https://groolegend.github.io/EgoSound/ .

representative citing papers

IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction

cs.CV · 2026-05-03 · unverdicted · novelty 5.0

IMPACT-HOI introduces a supervisory control framework for constructing partial HOI event graphs in procedural videos via trust-calibrated automation and atomic rollback to reduce manual annotation effort while preserving human decisions.

citing papers explorer

Showing 1 of 1 citing paper.

IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction cs.CV · 2026-05-03 · unverdicted · none · ref 18 · internal anchor
IMPACT-HOI introduces a supervisory control framework for constructing partial HOI event graphs in procedural videos via trust-calibrated automation and atomic rollback to reduce manual annotation effort while preserving human decisions.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

fields

years

verdicts

representative citing papers

citing papers explorer