Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
citing papers explorer
-
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation
VISAFF is a tuning-free speaker-centered visual affective feature learning framework for emotion recognition in conversation that guides frozen VLMs to active speakers and uses reliability-guided complementation from textual and acoustic modalities to achieve competitive performance.
-
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.