EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
Savvy: Spatial awareness via audio-visual llms through seeing and hearing.arXiv preprint arXiv:2506.05414,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.
citing papers explorer
-
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
-
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.