Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Bo Zheng; Caixin Kang; Huchuan Lu; Kaipeng Zhang; Liangyang Ouyang; Mingfang Zhang; Ruicong Liu; Sitong Gong; Tianyu Yan; Yifei Huang

arxiv: 2605.22109 · v1 · pith:33PA3K6Snew · submitted 2026-05-21 · 💻 cs.AI · cs.CV· cs.CY

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Caixin Kang , Tianyu Yan , Sitong Gong , Mingfang Zhang , Liangyang Ouyang , Ruicong Liu , Bo Zheng , Huchuan Lu

show 3 more authors

Kaipeng Zhang Yoichi Sato Yifei Huang

This is my paper

Pith reviewed 2026-05-22 05:40 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.CY

keywords multimodal large language modelspersonality perceptiongrounded reasoningBig Five traitsvideo understandingprejudice gapsocial cognition

0 comments

The pith

MLLMs often produce correct personality ratings without grounding them in observable video cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Grounded Personality Reasoning as a task that requires multimodal large language models to link Big Five personality ratings to specific evidence extracted from videos rather than stopping at numerical scores. It supplies a dataset of 1,104 videos with timestamped behavioral observations and evidence-grounded analyses, then evaluates 27 models across rating accuracy, reasoning quality, and grounding. The central result shows that 51 percent of correct ratings lack grounding in retrieved cues while holistic grounding rates range only from 0 to 33.5 percent. A sympathetic reader would care because many planned uses of these models involve social judgment, where ungrounded accuracy risks embedding prejudice instead of delivering perception.

Core claim

The authors establish that across the field of tested MLLMs, 51 percent of correct Big Five ratings are not grounded in retrieved cues, and holistic-grounding rates span only 0 to 33.5 percent. This disconnect is measured by requiring models to produce a rating, a reasoning step, and explicit grounding in timestamped observations on a dataset built through a multi-agent pipeline followed by human verification.

What carries the argument

The three-tier evaluation of rating, reasoning, and grounding together with four sample-level failure-mode metrics including Prejudice Rate and Holistic-Grounding Rate.

If this is right

Standard numerical accuracy benchmarks for personality perception will overstate model capability if they ignore grounding.
Model development must add mechanisms that retrieve and cite specific behavioral evidence rather than rely on first-impression patterns.
Deployment in human-facing roles such as interviews or team evaluation requires grounding checks to reduce the risk of unexamined prejudice.
New training objectives could reward explicit cue integration to close the observed gap between score and reason.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ungrounded pattern may appear in other multimodal tasks that involve interpreting human behavior.
Training data or fine-tuning that explicitly penalizes correct-but-ungrounded answers could be tested as a direct remedy.
Extending the approach to longer videos or live interactions would reveal whether the prejudice gap widens with more complex input.

Load-bearing premise

The multi-agent pipeline combined with human verification produces reliable timestamped behavioral observations that constitute valid ground truth for personality trait analyses.

What would settle it

Independent human re-annotation of a random subset of the videos followed by recomputation of the prejudice and holistic-grounding rates to check whether the reported percentages hold.

Figures

Figures reproduced from arXiv: 2605.22109 by Bo Zheng, Caixin Kang, Huchuan Lu, Kaipeng Zhang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, Sitong Gong, Tianyu Yan, Yifei Huang, Yoichi Sato.

**Figure 1.** Figure 1: Overview of MM-OCEAN. Multimodal inputs are processed by a multi-agent humancollaborative pipeline, filtered by text-only LLMs, and reviewed by experts to produce a benchmark supporting three tasks: ordinal Big Five rating (T1), open-ended evidence-grounded reasoning (T2), and structured cue-grounding Multiple-Choice Questions (MCQs) (T3). behavioral micro-cues such as gaze and posture shifts, not on gest… view at source ↗

**Figure 2.** Figure 2: MM-OCEAN overview. (a) Three-layer sunburst over benchmark scope, three evaluation tasks, and the seven cue-grounding categories. (b) Atomic-observation density across the four perceptual channels; bounding-box geometry is attached to every Expression / Action observation. Outputs across the three tasks. A model fθ must produce: T1 (Rating) yˆi ∈ L, ∀ i ∈ T , (1) T2 (Reasoning) (Oˆ, Rˆ) = fθ(V ), Oˆ = {ok}… view at source ↗

**Figure 3.** Figure 3: The five-stage multi-agent human-collaborative annotation pipeline. Observer drafts atomic observations → Annotator verifies and localizes them (Stage 1) → Psychologist produces evidence-grounded Big Five analyses (Stage 2) → Examiner generates seven categories of cuegrounding MCQs (Stage 3) → Aligner enforces four consistency checks C1–C4 (Stage 4) → Stage 5 applies text-leakage filtering (a) and expert … view at source ↗

**Figure 4.** Figure 4: Per-category cognitive radar (T3). Top-3 closed vs. Top-3 open accuracy across the seven cue-grounding MCQ categories. The closed-source advantage concentrates on the visual-grounding cluster (Spatial Localization, Micro-expression, Temporal-Spatial Joint). 25 20 15 10 5 0 Rank on T1 (Rating) 0 5 10 15 20 25 Average rank on T2 + T3 (Reasoning + Grounding) Gemini 3.1 Pro Gemini 3 Flash GPT-5.5 Gemini 2.5 Pr… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs get many personality ratings right but ground only a minority in actual video cues, with the paper's 51% prejudice figure and low holistic rates as the core finding.

read the letter

The paper's main point is that MLLMs often produce correct Big Five personality scores from video without anchoring those scores in the actual behavioral cues present. Across 27 models the prejudice rate hits 51% on correct answers, and holistic grounding rates range from 0 to 33.5% at best. That disconnect between score and evidence is the result worth paying attention to for anyone using these models in hiring, education, or social tools.

Referee Report

1 major / 2 minor

Summary. The paper introduces Grounded Personality Reasoning (GPR) as a new task requiring MLLMs to anchor Big Five personality ratings in observable video evidence via rating-reasoning-grounding chains. It releases the MM-OCEAN dataset (1,104 videos, 5,320 MCQs) constructed via a multi-agent pipeline plus human verification, containing timestamped observations and cue-grounding questions. The authors benchmark 27 MLLMs (13 closed, 14 open) using three-tier evaluation and four failure-mode metrics (Prejudice Rate, Confabulation Rate, Integration-failure Rate, Holistic-grounding Rate), reporting that 51% of correct ratings lack grounding in retrieved cues and that Holistic-Grounding Rates range from 0-33.5%.

Significance. If the central findings hold after validation, the work is significant for shifting personality-perception evaluation from score accuracy alone to process grounding, exposing a systematic disconnect between correct outputs and evidence-based reasoning in current MLLMs. The new task definition, dataset, and metrics provide concrete tools for measuring and improving social cognition capabilities relevant to human-facing applications.

major comments (1)

[Dataset and evaluation setup] Dataset construction (multi-agent pipeline + human verification): The Prejudice Gap claim (51% of correct ratings ungrounded) and Holistic-Grounding Rate range (0-33.5%) are load-bearing on the fidelity of the ground-truth cue labels and MCQ distinctions. No inter-annotator agreement, pipeline error rates, or external validation metrics for cue extraction or grounding criteria are reported, leaving open whether the observed gap reflects model behavior or annotation artifacts.

minor comments (2)

[Task formalization] Clarify the exact rules used to map timestamped observations to the seven categories of cue-grounding MCQs and how these categories map to the four failure-mode metrics.
[Benchmark results] Provide statistical significance tests or confidence intervals for the 51% figure and the per-model Holistic-Grounding Rates to support cross-model comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the single major comment below, committing to strengthen the manuscript's reporting on dataset validation.

read point-by-point responses

Referee: [Dataset and evaluation setup] Dataset construction (multi-agent pipeline + human verification): The Prejudice Gap claim (51% of correct ratings ungrounded) and Holistic-Grounding Rate range (0-33.5%) are load-bearing on the fidelity of the ground-truth cue labels and MCQ distinctions. No inter-annotator agreement, pipeline error rates, or external validation metrics for cue extraction or grounding criteria are reported, leaving open whether the observed gap reflects model behavior or annotation artifacts.

Authors: We agree that quantitative validation of the ground-truth labels is essential for interpreting the Prejudice Gap and Holistic-Grounding Rates. The MM-OCEAN construction used a multi-agent pipeline for initial extraction followed by human verification, but the initial submission did not report inter-annotator agreement or pipeline error rates. In the revised manuscript we will add a dedicated subsection reporting Cohen's kappa for cue relevance and grounding judgments during human verification, plus error rates observed in the pipeline. This will clarify that the reported gaps primarily reflect model behavior rather than annotation artifacts. revision: yes

Circularity Check

0 steps flagged

New task, dataset, and empirical rates show no circular reduction

full rationale

The paper defines a new task (Grounded Personality Reasoning), constructs MM-OCEAN via multi-agent pipeline plus human verification, and reports direct empirical statistics (51% ungrounded correct ratings, HR 0-33.5%) from three-tier evaluation of 27 MLLMs. These quantities are computed from the newly annotated data rather than obtained by fitting parameters to model outputs and relabeling them as predictions, or by any self-citation chain that supplies the central result. No equations, uniqueness theorems, or ansatzes are invoked that would make the reported gap equivalent to the input annotations by construction. The analysis is therefore self-contained as a standard benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the Big Five model plus the multi-agent pipeline with human verification yields trustworthy ground-truth behavioral observations; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The Big Five personality traits can be reliably inferred from observable behavioral cues in short videos.
The entire GPR task and dataset construction rest on this premise.

pith-pipeline@v0.9.0 · 5850 in / 1367 out tokens · 38147 ms · 2026-05-22T05:40:07.386783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 14 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

work page 1992
[3]

Claude Haiku 4.5

Anthropic. Claude Haiku 4.5. https://www.anthropic.com/claude/haiku, 2025. Claude Haiku 4.5; accessed 2026-05-04

work page 2025
[4]

Claude Opus 4.6

Anthropic. Claude Opus 4.6. https://www.anthropic.com/claude/opus, 2025. Claude Opus 4.6; accessed 2026-05-04

work page 2025
[5]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2025. Claude Sonnet 4.6; accessed 2026-05-04

work page 2025
[6]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024
[7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

Murray R Barrick and Michael K Mount. The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

work page 1991
[9]

Towards interactive intelligence for digital humans

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans. arXiv preprint arXiv:2512.13674, 2025

work page arXiv 2025
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

European Parliament Council and the. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008,(eu) no 167/2013,(eu) no 168/2013,(eu) 2018/858,(eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu,(eu) 2016/797 and (eu) 202...

work page 2024
[13]

Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969

Paul Ekman and Wallace V Friesen. Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969. 10

work page 1969
[14]

Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

Hugo Jair Escalante, Heysem Kaya, Albert Ali Salah, Sergio Escalera, Ya ˘gmur Güçlütürk, Umut Güçlü, Xavier Baró, Isabelle Guyon, Julio CS Jacques Junior, Meysam Madadi, et al. Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

work page 2020
[15]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024
[16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025
[17]

On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

David C Funder. On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

work page 1995
[18]

Moody5: Personality-biased agents to enhance interactive storytelling in video games

Francesco Garavaglia, Renato Avellar Nobre, Laura Anna Ripamonti, Dario Maggiorini, and Davide Gadia. Moody5: Personality-biased agents to enhance interactive storytelling in video games. In2022 IEEE Conference on Games (CoG), pages 175–182. IEEE, 2022

work page 2022
[19]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

work page 2021
[20]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/, 2025. Gemma- 4-31B-it; accessed 2026-05-04

work page 2025
[22]

Gemini 3

Google DeepMind. Gemini 3. https://blog.google/products/gemini/gemini-3/,

work page
[23]

Multimodal large language model

work page
[24]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/, 2025. Gemini 3.1 Pro; accessed 2026-05-04

work page 2025
[25]

Gemini 3 flash

Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/, 2025. Gemini 3 Flash; accessed 2026-05-04

work page 2025
[26]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

work page 2014
[27]

Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition

Ya˘gmur Güçlütürk, Umut Güçlü, Marcel AJ van Gerven, and Rob van Lier. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In European conference on computer vision, pages 349–358. Springer, 2016

work page 2016
[28]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al. Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

work page arXiv 2026
[30]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

Oliver P John, Laura P Naumann, and Christopher J Soto. Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

work page 2008
[32]

Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

work page arXiv 2025
[33]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023
[34]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[35]

Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

Jeff T Larsen, A Peter McGraw, and John T Cacioppo. Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

work page 2001
[36]

Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, et al. Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

work page arXiv 2025
[37]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024
[38]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[39]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[40]

Single-to-dual-view adaptation for egocentric 3d hand pose estimation

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, and Yoichi Sato. Single-to-dual-view adaptation for egocentric 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 677–686, 2024

work page 2024
[41]

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, and Yoichi Sato. Sfhand: A streaming framework for language-guided 3d hand forecasting and embodied manipulation. arXiv preprint arXiv:2511.18127, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024
[43]

Generalizing gaze estimation with outlier- guided collaborative adaptation

Yunfei Liu, Ruicong Liu, Haofei Wang, and Feng Lu. Generalizing gaze estimation with outlier- guided collaborative adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 3835–3844, 2021

work page 2021
[44]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023
[45]

Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

Robert R McCrae and Paul T Costa. Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

work page 1987
[46]

Llama 4 Maverick

Meta. Llama 4 Maverick. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/ , 2025. Llama-4-Maverick (FP8 variant); ac- cessed 2026-05-04. 12

work page 2025
[47]

Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

Iftekhar Naim, Md Iftekhar Tanveer, Daniel Gildea, and Mohammed Ehsan Hoque. Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

work page 2016
[48]

OpenAI. GPT-5. https://openai.com/index/introducing-gpt-5/, 2025. Large lan- guage model

work page 2025
[49]

GPT-5.4.https://openai.com/, 2025

OpenAI. GPT-5.4.https://openai.com/, 2025. GPT-5.4; accessed 2026-05-04

work page 2025
[50]

GPT-5.4-mini

OpenAI. GPT-5.4-mini. https://openai.com/, 2025. GPT-5.4-mini; accessed 2026-05-04

work page 2025
[51]

GPT-5.5.https://openai.com/, 2025

OpenAI. GPT-5.5.https://openai.com/, 2025. GPT-5.5; accessed 2026-05-04

work page 2025
[52]

OpenAI. o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/ , 2025. Reasoning-capable language model

work page 2025
[53]

Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

work page arXiv 2025
[54]

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, and Yoichi Sato. Socialdirector: Training-free social interaction control for multi-person video generation.arXiv preprint arXiv:2605.10079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset

Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio Junior, Jacques CS, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, et al. Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1–12, 2021

work page 2021
[56]

Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

work page 2023
[57]

Chalearn lap 2016: First round challenge on first impressions-dataset and results

Víctor Ponce-López, Baiyu Chen, Marc Oliu, Ciprian Corneanu, Albert Clapés, Isabelle Guyon, Xavier Baró, Hugo Jair Escalante, and Sergio Escalera. Chalearn lap 2016: First round challenge on first impressions-dataset and results. InEuropean conference on computer vision, pages 400–418. Springer, 2016

work page 2016
[58]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019
[59]

Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2025. Large language model

work page 2025
[60]

Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

Neal J Roese. Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

work page 1997
[61]

Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

Hossein Saberi and Reza Ravanmehr. Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

work page 2026
[62]

Emobench: Evaluating the emotional intelligence of large language models

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

work page 2024
[63]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019
[64]

Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025

Cheng Tang, Chao Tang, Steven Gong, Thomas M Kwok, and Yue Hu. Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025. 13

work page arXiv 2025
[65]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, 2023

work page 2023
[70]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021
[71]

Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

work page 2024
[72]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-Hsin Chen, and Xiaolan Fu. Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

work page 2014
[74]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

work page 2016
[76]

You are a non- interpretive behavior recorder. Record only what is observable; never explain why

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 14 Appendix Contents Appendix Overview The appendix is grouped thematically. Cross-references in...

work page 2020

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis.Psychological bulletin, 111(2):256, 1992

work page 1992

[3] [3]

Claude Haiku 4.5

Anthropic. Claude Haiku 4.5. https://www.anthropic.com/claude/haiku, 2025. Claude Haiku 4.5; accessed 2026-05-04

work page 2025

[4] [4]

Claude Opus 4.6

Anthropic. Claude Opus 4.6. https://www.anthropic.com/claude/opus, 2025. Claude Opus 4.6; accessed 2026-05-04

work page 2025

[5] [5]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2025. Claude Sonnet 4.6; accessed 2026-05-04

work page 2025

[6] [6]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

work page 2024

[7] [7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

Murray R Barrick and Michael K Mount. The big five personality dimensions and job perfor- mance: a meta-analysis.Personnel psychology, 44(1):1–26, 1991

work page 1991

[9] [9]

Towards interactive intelligence for digital humans

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, et al. Towards interactive intelligence for digital humans. arXiv preprint arXiv:2512.13674, 2025

work page arXiv 2025

[10] [10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

work page 2024

[11] [11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

European Parliament Council and the. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008,(eu) no 167/2013,(eu) no 168/2013,(eu) 2018/858,(eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu,(eu) 2016/797 and (eu) 202...

work page 2024

[13] [13]

Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969

Paul Ekman and Wallace V Friesen. Nonverbal leakage and clues to deception.Psychiatry, 32 (1):88–106, 1969. 10

work page 1969

[14] [14]

Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

Hugo Jair Escalante, Heysem Kaya, Albert Ali Salah, Sergio Escalera, Ya ˘gmur Güçlütürk, Umut Güçlü, Xavier Baró, Isabelle Guyon, Julio CS Jacques Junior, Meysam Madadi, et al. Modeling, recognizing, and explaining apparent personality from videos.IEEE Transactions on Affective Computing, 13(2):894–911, 2020

work page 2020

[15] [15]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024

[16] [16]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

work page 2025

[17] [17]

On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

David C Funder. On the accuracy of personality judgment: a realistic approach.Psychological review, 102(4):652, 1995

work page 1995

[18] [18]

Moody5: Personality-biased agents to enhance interactive storytelling in video games

Francesco Garavaglia, Renato Avellar Nobre, Laura Anna Ripamonti, Dario Maggiorini, and Davide Gadia. Moody5: Personality-biased agents to enhance interactive storytelling in video games. In2022 IEEE Conference on Games (CoG), pages 175–182. IEEE, 2022

work page 2022

[19] [19]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

work page 2021

[20] [20]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/, 2025. Gemma- 4-31B-it; accessed 2026-05-04

work page 2025

[22] [22]

Gemini 3

Google DeepMind. Gemini 3. https://blog.google/products/gemini/gemini-3/,

work page

[23] [23]

Multimodal large language model

work page

[24] [24]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/, 2025. Gemini 3.1 Pro; accessed 2026-05-04

work page 2025

[25] [25]

Gemini 3 flash

Google DeepMind. Gemini 3 flash. https://deepmind.google/models/gemini/, 2025. Gemini 3 Flash; accessed 2026-05-04

work page 2025

[26] [26]

The distress analysis interview corpus of human and computer interviews

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al. The distress analysis interview corpus of human and computer interviews. InLrec, volume 14, pages 3123–3128. Reykjavik, 2014

work page 2014

[27] [27]

Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition

Ya˘gmur Güçlütürk, Umut Güçlü, Marcel AJ van Gerven, and Rob van Lier. Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition. In European conference on computer vision, pages 349–358. Springer, 2016

work page 2016

[28] [28]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, et al. Step3-vl-10b technical report.arXiv preprint arXiv:2601.09668, 2026

work page arXiv 2026

[30] [30]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

Oliver P John, Laura P Naumann, and Christopher J Soto. Paradigm shift to the integrative big five trait taxonomy.Handbook of personality: Theory and research, 3(2):114–158, 2008

work page 2008

[32] [32]

Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Ruicong Liu, and Yoichi Sato. Can mllms read the room? a multimodal benchmark for assessing deception in multi-party social interactions.arXiv preprint arXiv:2511.16221, 2025

work page arXiv 2025

[33] [33]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023

[34] [34]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[35] [35]

Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

Jeff T Larsen, A Peter McGraw, and John T Cacioppo. Can people feel happy and sad at the same time?Journal of personality and social psychology, 81(4):684, 2001

work page 2001

[36] [36]

Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, et al. Xiaomi mimo-vl-miloco technical report.arXiv preprint arXiv:2512.17436, 2025

work page arXiv 2025

[37] [37]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

work page 2024

[38] [38]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[39] [39]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024

[40] [40]

Single-to-dual-view adaptation for egocentric 3d hand pose estimation

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, and Yoichi Sato. Single-to-dual-view adaptation for egocentric 3d hand pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 677–686, 2024

work page 2024

[41] [41]

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, and Yoichi Sato. Sfhand: A streaming framework for language-guided 3d hand forecasting and embodied manipulation. arXiv preprint arXiv:2511.18127, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

work page 2024

[43] [43]

Generalizing gaze estimation with outlier- guided collaborative adaptation

Yunfei Liu, Ruicong Liu, Haofei Wang, and Feng Lu. Generalizing gaze estimation with outlier- guided collaborative adaptation. InProceedings of the IEEE/CVF international conference on computer vision, pages 3835–3844, 2021

work page 2021

[44] [44]

Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

work page 2023

[45] [45]

Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

Robert R McCrae and Paul T Costa. Validation of the five-factor model of personality across instruments and observers.Journal of personality and social psychology, 52(1):81, 1987

work page 1987

[46] [46]

Llama 4 Maverick

Meta. Llama 4 Maverick. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/ , 2025. Llama-4-Maverick (FP8 variant); ac- cessed 2026-05-04. 12

work page 2025

[47] [47]

Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

Iftekhar Naim, Md Iftekhar Tanveer, Daniel Gildea, and Mohammed Ehsan Hoque. Auto- mated analysis and prediction of job interview performance.IEEE Transactions on Affective Computing, 9(2):191–204, 2016

work page 2016

[48] [48]

OpenAI. GPT-5. https://openai.com/index/introducing-gpt-5/, 2025. Large lan- guage model

work page 2025

[49] [49]

GPT-5.4.https://openai.com/, 2025

OpenAI. GPT-5.4.https://openai.com/, 2025. GPT-5.4; accessed 2026-05-04

work page 2025

[50] [50]

GPT-5.4-mini

OpenAI. GPT-5.4-mini. https://openai.com/, 2025. GPT-5.4-mini; accessed 2026-05-04

work page 2025

[51] [51]

GPT-5.5.https://openai.com/, 2025

OpenAI. GPT-5.5.https://openai.com/, 2025. GPT-5.5; accessed 2026-05-04

work page 2025

[52] [52]

OpenAI. o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/ , 2025. Reasoning-capable language model

work page 2025

[53] [53]

Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

Liangyang Ouyang, Yifei Huang, Mingfang Zhang, Caixin Kang, Ryosuke Furuta, and Yoichi Sato. Multi-speaker attention alignment for multimodal social interaction.arXiv preprint arXiv:2511.17952, 2025

work page arXiv 2025

[54] [54]

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, and Yoichi Sato. Socialdirector: Training-free social interaction control for multi-person video generation.arXiv preprint arXiv:2605.10079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset

Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio Junior, Jacques CS, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, et al. Context-aware per- sonality inference in dyadic scenarios: Introducing the udiva dataset. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1–12, 2021

work page 2021

[56] [56]

Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748–42761, 2023

work page 2023

[57] [57]

Chalearn lap 2016: First round challenge on first impressions-dataset and results

Víctor Ponce-López, Baiyu Chen, Marc Oliu, Ciprian Corneanu, Albert Clapés, Isabelle Guyon, Xavier Baró, Hugo Jair Escalante, and Sergio Escalera. Chalearn lap 2016: First round challenge on first impressions-dataset and results. InEuropean conference on computer vision, pages 400–418. Springer, 2016

work page 2016

[58] [58]

Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

work page 2019

[59] [59]

Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2025. Large language model

work page 2025

[60] [60]

Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

Neal J Roese. Counterfactual thinking.Psychological bulletin, 121(1):133, 1997

work page 1997

[61] [61]

Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

Hossein Saberi and Reza Ravanmehr. Transformer-based personality trait recognition enhanced by contextual augmentation.International Journal of Web Research, 9(1):1–24, 2026

work page 2026

[62] [62]

Emobench: Evaluating the emotional intelligence of large language models

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. Emobench: Evaluating the emotional intelligence of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5986–6004, 2024

work page 2024

[63] [63]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

work page 2019

[64] [64]

Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025

Cheng Tang, Chao Tang, Steven Gong, Thomas M Kwok, and Yue Hu. Robot character generation and adaptive human-robot interaction with personality shaping.arXiv preprint arXiv:2503.15518, 2025. 13

work page arXiv 2025

[65] [65]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen Team. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models

Yufan Wu, Yinghui He, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706, 2023

work page 2023

[70] [70]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

work page 2021

[71] [71]

Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593–8623, 2024

work page 2024

[72] [72]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

Wen-Jing Yan, Xiaobai Li, Su-Jing Wang, Guoying Zhao, Yong-Jin Liu, Yu-Hsin Chen, and Xiaolan Fu. Casme ii: An improved spontaneous micro-expression database and the baseline evaluation.PloS one, 9(1):e86041, 2014

work page 2014

[74] [74]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

work page 2016

[76] [76]

You are a non- interpretive behavior recorder. Record only what is observable; never explain why

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10668–10677, 2020. 14 Appendix Contents Appendix Overview The appendix is grouped thematically. Cross-references in...

work page 2020