SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Jiebo Luo; Jinfa Huang; Qingchuan Ma; Rongfang Luo; Rongrong Ji; Ruize Fang; Tianyu Xie; Wang Chen; Xiawu Zheng; Yan Yang

arxiv: 2603.16859 · v2 · pith:GRI5P6UNnew · submitted 2026-03-17 · 💻 cs.AI

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie , Jinfa Huang , Yuexiao Ma , Rongfang Luo , Yan Yang , Wang Chen , Yuhui Zeng , Yixuan Zou

show 6 more authors

Qingchuan Ma Zhiqiang Lu Ruize Fang Xiawu Zheng Jiebo Luo Rongrong Ji

This is my paper

classification 💻 cs.AI

keywords socialomniinteractivityinterruptionmodelsolmssocialacrossaudio-visual

0 comments

read the original abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
cs.CV 2026-05 unverdicted novelty 7.0

GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
CogniRoute: Learning to Route Social Evidence in Omni-Modal Models
cs.CV 2026-06 unverdicted novelty 5.0

CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.