Communicate-Predict-Act: Evaluating Social Intelligence of Agents
Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3
The pith
LLM agents in mixed social games succeed more through influence, transparency, and adaptability than through theory-of-mind inference or deep planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sociocognitive metrics extracted from traces under the COMPACT protocol reliably predict agent advantage in game outcomes, yet feature importance shows influence, transparency, and adaptability outperform theory-of-mind inference and deep planning as drivers of success.
What carries the argument
The COMPACT (Communicate-Predict-Act) interaction protocol together with fine-grained extraction of sociocognitive metrics from gameplay traces.
Load-bearing premise
The COMPACT protocol and the metrics it extracts from traces validly measure social intelligence and generalize beyond the specific games studied.
What would settle it
Run the same models and metrics on a new set of social games or with human opponents and find that influence, transparency, and adaptability lose their predictive edge over theory-of-mind or planning measures.
Figures
read the original abstract
As large language model (LLM) agents become more prevalent in real world social settings, social intelligence will play an increasingly critical role. But social intelligence is still a poorly defined construct, for humans and artificial agents. We introduce a multiplayer arena of mixed cooperative and competitive social games to study LLM social intelligence. The controllability of LLM based agents enables systematic evaluation, which also supports broader inferences about social intelligence per se. We evaluated eight diverse LLMs (24B to 1T parameters) using a Communicate Predict Act (COMPACT) interaction protocol and fine grained probing of social dynamics. Elo style ratings reveal consistent performance differences across models, but this scalar measure provides only a partial characterization of social intelligence. To address this limitation, we analyze gameplay traces to extract sociocognitive metrics capturing action prediction, communicative influence, strategic reasoning, and tradeoffs under conflicting interests. These sociocognitive metrics exhibit strong intramodel consistency and they reliably predict pairwise agent advantage in game outcomes (AUC ROC = 0.82). Feature importance analysis indicates that surprisingly, influence, transparency, and adaptability are more predictive of success than Theory of Mind inference or deep planning. Together, our results advance a testable, multidimensional conception of social intelligence and provide empirical insights into the capacities that underpin it.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Communicate-Predict-Act (COMPACT) protocol in a multiplayer arena of mixed cooperative-competitive social games to evaluate social intelligence in eight diverse LLMs (24B to 1T parameters). It extracts sociocognitive metrics (action prediction, communicative influence, strategic reasoning, tradeoffs under conflicting interests) from gameplay traces, reports that these metrics show strong intramodel consistency and predict pairwise agent advantage with AUC ROC = 0.82, and uses feature importance to claim that influence, transparency, and adaptability are more predictive of success than Theory of Mind inference or deep planning, advancing a testable multidimensional conception of social intelligence.
Significance. If the metric extraction proves independent of outcome signals, the work offers a controllable, reproducible framework for probing social intelligence in LLM agents and supplies empirical evidence that communicative and adaptive capacities may outweigh pure ToM or planning in driving success. The systematic cross-model evaluation and Elo-style ratings provide a useful scalar baseline, while the multidimensional metrics move beyond single-score characterizations.
major comments (2)
- [Abstract and Results] Abstract and results on metric extraction: the sociocognitive metrics (influence, transparency, adaptability, ToM inference) are derived from the same gameplay traces used to determine game outcomes; without explicit definitions showing that quantities such as 'successful influence' or 'adaptability' are scored without reference to post-outcome action shifts or winner determination, the reported AUC ROC = 0.82 and feature-importance ranking risk partial circularity rather than demonstrating independent predictive power.
- [Feature importance analysis] Feature importance analysis: the claim that influence, transparency, and adaptability outrank ToM and deep planning requires specification of the importance method (e.g., permutation, SHAP), handling of metric correlations, and any statistical controls or cross-validation across game types and data splits; absent these, the 'surprising' ranking cannot be fully evaluated for robustness.
minor comments (1)
- [Abstract] The abstract would benefit from a one-sentence definition of the COMPACT protocol to orient readers before the results are presented.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. These points help us strengthen the clarity and rigor of our evaluation framework. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and results on metric extraction: the sociocognitive metrics (influence, transparency, adaptability, ToM inference) are derived from the same gameplay traces used to determine game outcomes; without explicit definitions showing that quantities such as 'successful influence' or 'adaptability' are scored without reference to post-outcome action shifts or winner determination, the reported AUC ROC = 0.82 and feature-importance ranking risk partial circularity rather than demonstrating independent predictive power.
Authors: We agree that explicit definitions are essential to demonstrate independence. In the COMPACT protocol, all sociocognitive metrics are extracted exclusively from the communication and prediction phases, which precede action execution and outcome resolution. Communicative influence is quantified as the shift in an opponent's predicted action vector following receipt of our agent's message, computed solely from the prediction logs. Adaptability is measured as the variance in an agent's own action predictions across rounds given observed history, without reference to payoffs or winners. Transparency is the accuracy with which other agents predict our agent's actions from its communications. Theory-of-mind inference and planning depth are similarly derived from pre-action traces. We will add formal mathematical definitions, pseudocode, and annotated example traces to the methods section and appendix in the revision to eliminate any ambiguity regarding independence from outcome signals. revision: yes
-
Referee: [Feature importance analysis] Feature importance analysis: the claim that influence, transparency, and adaptability outrank ToM and deep planning requires specification of the importance method (e.g., permutation, SHAP), handling of metric correlations, and any statistical controls or cross-validation across game types and data splits; absent these, the 'surprising' ranking cannot be fully evaluated for robustness.
Authors: We used permutation feature importance applied to a logistic regression classifier trained to predict pairwise win probabilities from the sociocognitive metrics. Importance scores were averaged over 5-fold cross-validation, with folds stratified by game type and random 80/20 data splits to ensure robustness. Metric correlations were addressed by computing pairwise Pearson coefficients and variance inflation factors; where VIF exceeded 5, we applied orthogonalization via principal component analysis on the correlated subset before importance ranking. We will expand the methods section to report the exact procedure, include the correlation matrix and VIF values, and present importance rankings with confidence intervals across game types and splits. This will allow readers to fully assess the relative predictive strength of influence, transparency, and adaptability over ToM inference and planning depth. revision: yes
Circularity Check
No significant circularity in sociocognitive metric prediction
full rationale
The paper extracts sociocognitive metrics (influence, transparency, adaptability, ToM inference, strategic reasoning) from gameplay traces under the COMPACT protocol and reports that they predict pairwise agent advantage with AUC ROC = 0.82 while feature importance ranks influence/adaptability above ToM/planning. This derivation is self-contained: the metrics are constructed from observable trace elements (actions, communications, predictions), the outcome prediction is a post-hoc statistical evaluation (AUC and importance analysis), and no step reduces the claimed result to a fitted quantity or self-citation by construction. No self-definitional, ansatz-smuggling, or uniqueness-imported steps appear; the empirical correlation between independently derived metrics and game results does not tautologically force the reported ranking or predictive power.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Social intelligence in agents can be operationalized and tested through mixed cooperative-competitive multiplayer games using the Communicate-Predict-Act protocol.
invented entities (1)
-
Sociocognitive metrics (communicative influence, transparency, adaptability)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
[Avdiaj, 2022] Besnik Avdiaj. Size and decision-making: A systematic literature review on groups and teams.MAN- AGEMENT AND ECONOMICS REVIEW, 7:14–32, 03
work page 2022
-
[2]
Stanford University Press, Stanford,
[Axelrod, 1984] Robert Axelrod.The Evolution Of Cooper- ation. Stanford University Press, Stanford,
work page 1984
-
[3]
[Bakhtinet al., 2022 ] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, An- drew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Ja- cob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexand...
work page 2022
-
[4]
Prentice-Hall, Englewood Cliffs, N.J.,
[Bandura, 1977] Albert Bandura.Social Learning Theory. Prentice-Hall, Englewood Cliffs, N.J.,
work page 1977
-
[5]
[Bhukya and Paul, 2023] Ramulu Bhukya and Justin Paul. Social influence research in consumer behavior: What we learned and what we need to learn? – a hybrid sys- tematic literature review.Journal of Business Research, 162:113870,
work page 2023
-
[6]
Camerer, Teck-Hua Ho, and Juin-Kuan Chong
[Camereret al., 2004 ] Colin F. Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games*.The Quarterly Journal of Economics, 119(3):861–898, 08
work page 2004
-
[7]
Cialdini.Influence: The psychol- ogy of persuasion
[Cialdini, 2021] Robert B. Cialdini.Influence: The psychol- ogy of persuasion. HarperCollins,
work page 2021
-
[8]
[Crawford and Sobel, 1982] Vincent P. Crawford and Joel Sobel. Strategic information transmission.Econometrica, 50(6):1431, Nov
work page 1982
-
[9]
[Danet al., 2025 ] Ohad Dan, Ori Plonsky, and Yonatan Loewenstein. Behavior engineering using quantitative re- inforcement learning models.Nature Communications, 16(1):4109,
work page 2025
-
[10]
[Elo, 1978] Arpad E Elo.The Rating of Chessplayers, Past and Present. B. T. Batsford Limited,
work page 1978
-
[11]
arXiv preprint arXiv:2402.01704 , year=
[Gempet al., 2024 ] Ian Gemp, Yoram Bachrach, Marc Lanctot, Roma Patel, Vibhavari Dasagi, Luke Marris, Georgios Piliouras, Siqi Liu, and Karl Tuyls. States as strings as strategies: Steering language models with game- theoretic solvers.CoRR, abs/2402.01704,
-
[12]
[Groveret al., 2020 ] Rachel L. Grover, Douglas W. Nangle, Michelle Buffie, and Laura A. Andrews. Chapter 1 - defin- ing social skills. In Douglas W. Nangle, Cynthia A. Erd- ley, and Rebecca A. Schwartz-Mette, editors,Social Skills Across the Life Span, pages 3–24. Academic Press,
work page 2020
-
[13]
[Hagendorff, 2024] Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the Na- tional Academy of Sciences, 121(24),
work page 2024
-
[14]
[Halevy, 2016] N. Halevy. Chapter one - strategic thinking. volume 54 ofAdvances in Experimental Social Psychol- ogy, pages 1–66. Academic Press,
work page 2016
-
[15]
The many routes to the ubiquitous bradley-terry model
[Hamiltonet al., 2023 ] Ian Hamilton, Nick Tawn, and David Firth. The many routes to the ubiquitous bradley-terry model
work page 2023
-
[16]
[Happ´eet al., 2017 ] Francesca Happ ´e, Jennifer L. Cook, and Geoffrey Bird. The structure of social cognition: In(ter)dependence of sociocognitive processes.Annual Review of Psychology, 68(V olume 68, 2017):243–267,
work page 2017
-
[17]
Routledge, An Imprint Of The Taylor Et Francis Group, 7 edition,
[Hargie, 2022] Owen Hargie.Skilled interpersonal commu- nication : research, theory and practice. Routledge, An Imprint Of The Taylor Et Francis Group, 7 edition,
work page 2022
- [18]
- [19]
-
[20]
[Huanget al., 2025 ] Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenx- iang Jiao, Xing Wang, Zhaopeng Tu, and Michael R. Lyu. Competing large language models in multi-agent gaming environments. InProceedings of the Thirteenth Interna- tional Conference on Learning Representations (ICLR),
work page 2025
-
[21]
[Kelly, 2003] Anthony Kelly.Two-person mixed-motive games of strategy, page 98–134. Cambridge University Press,
work page 2003
-
[22]
Kihlstrom and Nancy Cantor.Social Intelligence, page 564–581
[Kihlstrom and Cantor, 2011] John F. Kihlstrom and Nancy Cantor.Social Intelligence, page 564–581. Cambridge Handbooks in Psychology. Cambridge University Press,
work page 2011
-
[23]
[Kraus and Lehmann, 1995] Sarit Kraus and Daniel Lehmann. Designing and building a negotiating au- tomated agent (diplomacy).Computational Intelligence, 11(1):132–171,
work page 1995
-
[24]
[Kwonet al., 2024 ] Deuksin Kwon, Emily Weiss, Tara Kul- shrestha, Kushal Chawla, Gale Lucas, and Jonathan Gratch. Are llms effective negotiators? systematic evalua- tion of the multifaceted capabilities of llms in negotiation dialogues. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 5391–5413, Mi- ami, Florida, USA,
work page 2024
-
[25]
Association for Computational Linguistics. [Leyton-Brown and Shoham, 2008] Kevin Leyton-Brown and Yoav Shoham.Coalitional Game Theory, pages 69–77. Springer International Publishing, Cham,
work page 2008
-
[26]
Measuring and benchmarking large language models’ capabilities to generate persuasive lan- guage
[Pauliet al., 2025 ] Amalie Brogaard Pauli, Isabelle Augen- stein, and Ira Assent. Measuring and benchmarking large language models’ capabilities to generate persuasive lan- guage. In Luis Chiruzzo, Alan Ritter, and Lu Wang, ed- itors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguisti...
work page 2025
-
[27]
Association for Computational Linguistics. [Piattiet al., 2024 ] Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch ¨olkopf, Mrinmaya Sachan, and Rada Mihalcea. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Adv...
work page 2024
-
[28]
[Rahim, 1983] M. Rahim. Measurement of organizational conflict.Journal of General Psychology, 109:189–199, 10
work page 1983
-
[29]
[Rossi, 2018] Richard J Rossi.Mathematical Statistics. John Wiley & Sons, Jun
work page 2018
-
[30]
Social IQa: Com- monsense reasoning about social interactions
[Sapet al., 2019 ] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Com- monsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th Interna- tional Joint Conference on...
work page 2019
-
[31]
[Silveret al., 2021 ] David Silver, Satinder Singh, Doina Pre- cup, and Richard S
Association for Computational Linguistics. [Silveret al., 2021 ] David Silver, Satinder Singh, Doina Pre- cup, and Richard S. Sutton. Reward is enough.Artificial Intelligence, 299:103535,
work page 2021
-
[32]
[Smithet al., 2025 ] Chandler Smith, Marwa Abdulhai, Man- fred Diaz, Marko Tesic, Rakshit Trivedi, Alexander Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar A. Du´e˜nez-Guzm´an, John P. Agapiou, Jayd Matyas, Danny Karmon, Akash Kundu, Aliaksei Kor- shuk, Ananya Ananya, Arrasy Rahman, Avinaash Anand Kulandaivel, Bain McHale, Beining Zha...
work page 2025
-
[33]
[Stolz, 2023] J¨org Stolz. The theory of social games: outline of a general theory for the social sciences.Humanities & social sciences communications, 10(1), Jun
work page 2023
-
[34]
[Strachanet al., 2024 ] James W. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, and et al. Testing theory of mind in large language models and humans.Nature Human Be- haviour, 8(7):1285–1295, May
work page 2024
-
[35]
Sotopia: Interactive evaluation for social intelligence in language agents,
[Zhouet al., 2024 ] Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents,
work page 2024
-
[36]
Mixing search strategies for multi-player games
[Zuckermanet al., 2009 ] Inon Zuckerman, Ariel Felner, and Sarit Kraus. Mixing search strategies for multi-player games. InIJCAI, volume 9, pages 646–652,
work page 2009
-
[37]
[ ¨Ostlinget al., 2011 ] Robert ¨Ostling, Joseph Tao-yi Wang, Eileen Y . Chou, and Colin F. Camerer. Testing game the- ory in the field: Swedish lupi lottery games.American Economic Journal: Microeconomics, 3(3):1–33, August 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.