Savvy: Spatial awareness via audio-visual llms through seeing and hearing.arXiv preprint arXiv:2506.05414

Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman · 2025 · arXiv 2506.05414

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.

citing papers explorer

Showing 3 of 3 citing papers.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 3
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
Do Joint Audio-Video Generation Models Understand Physics? cs.SD · 2026-05-08 · unverdicted · none · ref 8
AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks cs.AI · 2026-05-18 · unverdicted · none · ref 2
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.

Savvy: Spatial awareness via audio-visual llms through seeing and hearing.arXiv preprint arXiv:2506.05414

fields

years

verdicts

representative citing papers

citing papers explorer