Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
Entering real social world! benchmarking the theory of mind and socialization capabilities of llms from a first-person perspective. arxiv 2024,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
Introduces ToM-PD task and ToM-BPD dataset plus TTBYS dual-knowledge framework, with Qwen3-8B outperforming GPT-5 on desire, belief, and strategy prediction.
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
citing papers explorer
-
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents
Introduces ToM-PD task and ToM-BPD dataset plus TTBYS dual-knowledge framework, with Qwen3-8B outperforming GPT-5 on desire, belief, and strategy prediction.
-
Generating Place-Based Compromises Between Two Points of View
Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.