pith. sign in

arxiv: 2605.22109 · v1 · pith:33PA3K6Snew · submitted 2026-05-21 · 💻 cs.AI · cs.CV· cs.CY

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Pith reviewed 2026-05-22 05:40 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.CY
keywords multimodal large language modelspersonality perceptiongrounded reasoningBig Five traitsvideo understandingprejudice gapsocial cognition
0
0 comments X

The pith

MLLMs often produce correct personality ratings without grounding them in observable video cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Grounded Personality Reasoning as a task that requires multimodal large language models to link Big Five personality ratings to specific evidence extracted from videos rather than stopping at numerical scores. It supplies a dataset of 1,104 videos with timestamped behavioral observations and evidence-grounded analyses, then evaluates 27 models across rating accuracy, reasoning quality, and grounding. The central result shows that 51 percent of correct ratings lack grounding in retrieved cues while holistic grounding rates range only from 0 to 33.5 percent. A sympathetic reader would care because many planned uses of these models involve social judgment, where ungrounded accuracy risks embedding prejudice instead of delivering perception.

Core claim

The authors establish that across the field of tested MLLMs, 51 percent of correct Big Five ratings are not grounded in retrieved cues, and holistic-grounding rates span only 0 to 33.5 percent. This disconnect is measured by requiring models to produce a rating, a reasoning step, and explicit grounding in timestamped observations on a dataset built through a multi-agent pipeline followed by human verification.

What carries the argument

The three-tier evaluation of rating, reasoning, and grounding together with four sample-level failure-mode metrics including Prejudice Rate and Holistic-Grounding Rate.

If this is right

  • Standard numerical accuracy benchmarks for personality perception will overstate model capability if they ignore grounding.
  • Model development must add mechanisms that retrieve and cite specific behavioral evidence rather than rely on first-impression patterns.
  • Deployment in human-facing roles such as interviews or team evaluation requires grounding checks to reduce the risk of unexamined prejudice.
  • New training objectives could reward explicit cue integration to close the observed gap between score and reason.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ungrounded pattern may appear in other multimodal tasks that involve interpreting human behavior.
  • Training data or fine-tuning that explicitly penalizes correct-but-ungrounded answers could be tested as a direct remedy.
  • Extending the approach to longer videos or live interactions would reveal whether the prejudice gap widens with more complex input.

Load-bearing premise

The multi-agent pipeline combined with human verification produces reliable timestamped behavioral observations that constitute valid ground truth for personality trait analyses.

What would settle it

Independent human re-annotation of a random subset of the videos followed by recomputation of the prejudice and holistic-grounding rates to check whether the reported percentages hold.

Figures

Figures reproduced from arXiv: 2605.22109 by Bo Zheng, Caixin Kang, Huchuan Lu, Kaipeng Zhang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Sitong Gong, Tianyu Yan, Yifei Huang, Yoichi Sato.

Figure 1
Figure 1. Figure 1: Overview of MM-OCEAN. Multimodal inputs are processed by a multi-agent human￾collaborative pipeline, filtered by text-only LLMs, and reviewed by experts to produce a benchmark supporting three tasks: ordinal Big Five rating (T1), open-ended evidence-grounded reasoning (T2), and structured cue-grounding Multiple-Choice Questions (MCQs) (T3). behavioral micro-cues such as gaze and posture shifts, not on gest… view at source ↗
Figure 2
Figure 2. Figure 2: MM-OCEAN overview. (a) Three-layer sunburst over benchmark scope, three evaluation tasks, and the seven cue-grounding categories. (b) Atomic-observation density across the four perceptual channels; bounding-box geometry is attached to every Expression / Action observation. Outputs across the three tasks. A model fθ must produce: T1 (Rating) yˆi ∈ L, ∀ i ∈ T , (1) T2 (Reasoning) (Oˆ, Rˆ) = fθ(V ), Oˆ = {ok}… view at source ↗
Figure 3
Figure 3. Figure 3: The five-stage multi-agent human-collaborative annotation pipeline. Observer drafts atomic observations → Annotator verifies and localizes them (Stage 1) → Psychologist produces evidence-grounded Big Five analyses (Stage 2) → Examiner generates seven categories of cue￾grounding MCQs (Stage 3) → Aligner enforces four consistency checks C1–C4 (Stage 4) → Stage 5 applies text-leakage filtering (a) and expert … view at source ↗
Figure 4
Figure 4. Figure 4: Per-category cognitive radar (T3). Top-3 closed vs. Top-3 open accuracy across the seven cue-grounding MCQ categories. The closed-source advantage concentrates on the visual-grounding cluster (Spatial Localization, Micro-expression, Temporal-Spatial Joint). 25 20 15 10 5 0 Rank on T1 (Rating) 0 5 10 15 20 25 Average rank on T2 + T3 (Reasoning + Grounding) Gemini 3.1 Pro Gemini 3 Flash GPT-5.5 Gemini 2.5 Pr… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Grounded Personality Reasoning (GPR) as a new task requiring MLLMs to anchor Big Five personality ratings in observable video evidence via rating-reasoning-grounding chains. It releases the MM-OCEAN dataset (1,104 videos, 5,320 MCQs) constructed via a multi-agent pipeline plus human verification, containing timestamped observations and cue-grounding questions. The authors benchmark 27 MLLMs (13 closed, 14 open) using three-tier evaluation and four failure-mode metrics (Prejudice Rate, Confabulation Rate, Integration-failure Rate, Holistic-grounding Rate), reporting that 51% of correct ratings lack grounding in retrieved cues and that Holistic-Grounding Rates range from 0-33.5%.

Significance. If the central findings hold after validation, the work is significant for shifting personality-perception evaluation from score accuracy alone to process grounding, exposing a systematic disconnect between correct outputs and evidence-based reasoning in current MLLMs. The new task definition, dataset, and metrics provide concrete tools for measuring and improving social cognition capabilities relevant to human-facing applications.

major comments (1)
  1. [Dataset and evaluation setup] Dataset construction (multi-agent pipeline + human verification): The Prejudice Gap claim (51% of correct ratings ungrounded) and Holistic-Grounding Rate range (0-33.5%) are load-bearing on the fidelity of the ground-truth cue labels and MCQ distinctions. No inter-annotator agreement, pipeline error rates, or external validation metrics for cue extraction or grounding criteria are reported, leaving open whether the observed gap reflects model behavior or annotation artifacts.
minor comments (2)
  1. [Task formalization] Clarify the exact rules used to map timestamped observations to the seven categories of cue-grounding MCQs and how these categories map to the four failure-mode metrics.
  2. [Benchmark results] Provide statistical significance tests or confidence intervals for the 51% figure and the per-model Holistic-Grounding Rates to support cross-model comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment below, committing to strengthen the manuscript's reporting on dataset validation.

read point-by-point responses
  1. Referee: [Dataset and evaluation setup] Dataset construction (multi-agent pipeline + human verification): The Prejudice Gap claim (51% of correct ratings ungrounded) and Holistic-Grounding Rate range (0-33.5%) are load-bearing on the fidelity of the ground-truth cue labels and MCQ distinctions. No inter-annotator agreement, pipeline error rates, or external validation metrics for cue extraction or grounding criteria are reported, leaving open whether the observed gap reflects model behavior or annotation artifacts.

    Authors: We agree that quantitative validation of the ground-truth labels is essential for interpreting the Prejudice Gap and Holistic-Grounding Rates. The MM-OCEAN construction used a multi-agent pipeline for initial extraction followed by human verification, but the initial submission did not report inter-annotator agreement or pipeline error rates. In the revised manuscript we will add a dedicated subsection reporting Cohen's kappa for cue relevance and grounding judgments during human verification, plus error rates observed in the pipeline. This will clarify that the reported gaps primarily reflect model behavior rather than annotation artifacts. revision: yes

Circularity Check

0 steps flagged

New task, dataset, and empirical rates show no circular reduction

full rationale

The paper defines a new task (Grounded Personality Reasoning), constructs MM-OCEAN via multi-agent pipeline plus human verification, and reports direct empirical statistics (51% ungrounded correct ratings, HR 0-33.5%) from three-tier evaluation of 27 MLLMs. These quantities are computed from the newly annotated data rather than obtained by fitting parameters to model outputs and relabeling them as predictions, or by any self-citation chain that supplies the central result. No equations, uniqueness theorems, or ansatzes are invoked that would make the reported gap equivalent to the input annotations by construction. The analysis is therefore self-contained as a standard benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the Big Five model plus the multi-agent pipeline with human verification yields trustworthy ground-truth behavioral observations; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The Big Five personality traits can be reliably inferred from observable behavioral cues in short videos.
    The entire GPR task and dataset construction rest on this premise.

pith-pipeline@v0.9.0 · 5850 in / 1367 out tokens · 38147 ms · 2026-05-22T05:40:07.386783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

    Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

  3. [3]

    Claude Haiku 4.5

    Anthropic. Claude Haiku 4.5. https://www.anthropic.com/claude/haiku, 2025. Claude Haiku 4.5; accessed 2026-05-04

  4. [4]

    Claude Opus 4.6

    Anthropic. Claude Opus 4.6. https://www.anthropic.com/claude/opus, 2025. Claude Opus 4.6; accessed 2026-05-04

  5. [5]

    Claude Sonnet 4.6

    Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2025. Claude Sonnet 4.6; accessed 2026-05-04

  6. [6]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  8. [8]

    The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

    Murray R Barrick and Michael K Mount. The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

  9. [9]

    Towards interactive intelligence for digital humans

    Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans. arXiv preprint arXiv:2512.13674, 2025

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    European Parliament Council and the. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008,(eu) no 167/2013,(eu) no 168/2013,(eu) 2018/858,(eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu,(eu) 2016/797 and (eu) 202...

  13. [13]

    Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969

    Paul Ekman and Wallace V Friesen. Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969. 10

  14. [14]

    Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

    Hugo Jair Escalante, Heysem Kaya, Albert Ali Salah, Sergio Escalera, Ya ˘gmur Güçlütürk, Umut Güçlü, Xavier Baró, Isabelle Guyon, Julio CS Jacques Junior, Meysam Madadi, et al. Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

  15. [15]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

  16. [16]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  17. [17]

    On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

    David C Funder. On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

  18. [18]

    Moody5: Personality-biased agents to enhance interactive storytelling in video games

    Francesco Garavaglia, Renato Avellar Nobre, Laura Anna Ripamonti, Dario Maggiorini, and Davide Gadia. Moody5: Personality-biased agents to enhance interactive storytelling in video games. In2022 IEEE Conference on Games (CoG), pages 175–182. IEEE, 2022

  19. [19]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  20. [20]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

  21. [21]

    Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/, 2025. Gemma- 4-31B-it; accessed 2026-05-04

  22. [22]

    Gemini 3

    Google DeepMind. Gemini 3. https://blog.google/products/gemini/gemini-3/,

  23. [23]

    Multimodal large language model

  24. [24]

    Gemini 3.1 pro

    Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/, 2025. Gemini 3.1 Pro; accessed 2026-05-04

  25. [25]

    Gemini 3 flash

    Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/, 2025. Gemini 3 Flash; accessed 2026-05-04

  26. [26]

    The distress analysis interview corpus of human and computer interviews

    Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

  27. [27]

    Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition

    Ya˘gmur Güçlütürk, Umut Güçlü, Marcel AJ van Gerven, and Rob van Lier. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In European conference on computer vision, pages 349–358. Springer, 2016

  28. [28]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  29. [29]

    Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

    Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al. Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

  30. [30]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 11

  31. [31]

    Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

    Oliver P John, Laura P Naumann, and Christopher J Soto. Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

  32. [32]

    Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

    Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

  33. [33]

    Fantom: A benchmark for stress-testing machine theory of mind in interactions

    Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

  34. [34]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  35. [35]

    Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

    Jeff T Larsen, A Peter McGraw, and John T Cacioppo. Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

  36. [36]

    Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

    Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, et al. Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

  37. [37]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  38. [38]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  39. [39]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  40. [40]

    Single-to-dual-view adaptation for egocentric 3d hand pose estimation

    Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, and Yoichi Sato. Single-to-dual-view adaptation for egocentric 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 677–686, 2024

  41. [41]

    SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

    Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, and Yoichi Sato. Sfhand: A streaming framework for language-guided 3d hand forecasting and embodied manipulation. arXiv preprint arXiv:2511.18127, 2025

  42. [42]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

  43. [43]

    Generalizing gaze estimation with outlier- guided collaborative adaptation

    Yunfei Liu, Ruicong Liu, Haofei Wang, and Feng Lu. Generalizing gaze estimation with outlier- guided collaborative adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 3835–3844, 2021

  44. [44]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

  45. [45]

    Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

    Robert R McCrae and Paul T Costa. Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

  46. [46]

    Llama 4 Maverick

    Meta. Llama 4 Maverick. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/ , 2025. Llama-4-Maverick (FP8 variant); ac- cessed 2026-05-04. 12

  47. [47]

    Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

    Iftekhar Naim, Md Iftekhar Tanveer, Daniel Gildea, and Mohammed Ehsan Hoque. Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

  48. [48]

    OpenAI. GPT-5. https://openai.com/index/introducing-gpt-5/, 2025. Large lan- guage model

  49. [49]

    GPT-5.4.https://openai.com/, 2025

    OpenAI. GPT-5.4.https://openai.com/, 2025. GPT-5.4; accessed 2026-05-04

  50. [50]

    GPT-5.4-mini

    OpenAI. GPT-5.4-mini. https://openai.com/, 2025. GPT-5.4-mini; accessed 2026-05-04

  51. [51]

    GPT-5.5.https://openai.com/, 2025

    OpenAI. GPT-5.5.https://openai.com/, 2025. GPT-5.5; accessed 2026-05-04

  52. [52]

    OpenAI. o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/ , 2025. Reasoning-capable language model

  53. [53]

    Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

    Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

  54. [54]

    SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

    Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, and Yoichi Sato. Socialdirector: Training-free social interaction control for multi-person video generation.arXiv preprint arXiv:2605.10079, 2026

  55. [55]

    Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset

    Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio Junior, Jacques CS, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, et al. Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1–12, 2021

  56. [56]

    Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

  57. [57]

    Chalearn lap 2016: First round challenge on first impressions-dataset and results

    Víctor Ponce-López, Baiyu Chen, Marc Oliu, Ciprian Corneanu, Albert Clapés, Isabelle Guyon, Xavier Baró, Hugo Jair Escalante, and Sergio Escalera. Chalearn lap 2016: First round challenge on first impressions-dataset and results. InEuropean conference on computer vision, pages 400–418. Springer, 2016

  58. [58]

    Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

  59. [59]

    Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2025. Large language model

  60. [60]

    Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

    Neal J Roese. Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

  61. [61]

    Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

    Hossein Saberi and Reza Ravanmehr. Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

  62. [62]

    Emobench: Evaluating the emotional intelligence of large language models

    Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

  63. [63]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

  64. [64]

    Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025

    Cheng Tang, Chao Tang, Steven Gong, Thomas M Kwok, and Yue Hu. Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025. 13

  65. [65]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  66. [66]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  67. [67]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Qwen Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  68. [68]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  69. [69]

    Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models

    Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, 2023

  70. [70]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  71. [71]

    Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

    Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

  72. [72]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  73. [73]

    Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

    Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-Hsin Chen, and Xiaolan Fu. Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

  74. [74]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  75. [75]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

  76. [76]

    You are a non- interpretive behavior recorder. Record only what is observable; never explain why

    Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 14 Appendix Contents Appendix Overview The appendix is grouped thematically. Cross-references in...