pith. sign in

arxiv: 2604.08727 · v1 · submitted 2026-04-09 · 💻 cs.CY

Communicate-Predict-Act: Evaluating Social Intelligence of Agents

Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3

classification 💻 cs.CY
keywords social intelligenceLLM agentsmultiplayer gamessociocognitive metricsinfluenceadaptabilitytheory of mindCOMPACT protocol
0
0 comments X

The pith

LLM agents in mixed social games succeed more through influence, transparency, and adaptability than through theory-of-mind inference or deep planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a controllable multiplayer arena of cooperative and competitive games to evaluate social intelligence in large language model agents. It applies a Communicate-Predict-Act protocol to eight models ranging from 24B to 1T parameters and extracts detailed sociocognitive metrics from gameplay traces. These metrics prove consistent within each model and predict which agent will win pairwise matchups with an AUC of 0.82. Feature analysis reveals that influence over others, transparency of actions, and adaptability under conflict matter more for success than explicit mental-state modeling or long-horizon planning. The work therefore replaces a single scalar rating with a multidimensional, testable picture of what social intelligence requires.

Core claim

Sociocognitive metrics extracted from traces under the COMPACT protocol reliably predict agent advantage in game outcomes, yet feature importance shows influence, transparency, and adaptability outperform theory-of-mind inference and deep planning as drivers of success.

What carries the argument

The COMPACT (Communicate-Predict-Act) interaction protocol together with fine-grained extraction of sociocognitive metrics from gameplay traces.

Load-bearing premise

The COMPACT protocol and the metrics it extracts from traces validly measure social intelligence and generalize beyond the specific games studied.

What would settle it

Run the same models and metrics on a new set of social games or with human opponents and find that influence, transparency, and adaptability lose their predictive edge over theory-of-mind or planning measures.

Figures

Figures reproduced from arXiv: 2604.08727 by David Shoresh, Sarit Kraus, Yonatan Loewenstein.

Figure 1
Figure 1. Figure 1: Heatmaps showing probabilities of agents outperforming [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Bootstrap test for global ELO ratings. Violins represent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example illustrating how the LLM-judge rates influence [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between intra-agent and inter-agent Pear [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Feature Importances 5 Discussion 5.1 Conclusions We illustrated the central importance of communication in differentiating the social intelligence of LLM agents (see ab￾lation study). Communication is an essential element in hu￾man interactions and the main training data of LLMs. The most important socio-cognitive features we identified were influence and transparency, which are intrinsically linked to com… view at source ↗
read the original abstract

As large language model (LLM) agents become more prevalent in real world social settings, social intelligence will play an increasingly critical role. But social intelligence is still a poorly defined construct, for humans and artificial agents. We introduce a multiplayer arena of mixed cooperative and competitive social games to study LLM social intelligence. The controllability of LLM based agents enables systematic evaluation, which also supports broader inferences about social intelligence per se. We evaluated eight diverse LLMs (24B to 1T parameters) using a Communicate Predict Act (COMPACT) interaction protocol and fine grained probing of social dynamics. Elo style ratings reveal consistent performance differences across models, but this scalar measure provides only a partial characterization of social intelligence. To address this limitation, we analyze gameplay traces to extract sociocognitive metrics capturing action prediction, communicative influence, strategic reasoning, and tradeoffs under conflicting interests. These sociocognitive metrics exhibit strong intramodel consistency and they reliably predict pairwise agent advantage in game outcomes (AUC ROC = 0.82). Feature importance analysis indicates that surprisingly, influence, transparency, and adaptability are more predictive of success than Theory of Mind inference or deep planning. Together, our results advance a testable, multidimensional conception of social intelligence and provide empirical insights into the capacities that underpin it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Communicate-Predict-Act (COMPACT) protocol in a multiplayer arena of mixed cooperative-competitive social games to evaluate social intelligence in eight diverse LLMs (24B to 1T parameters). It extracts sociocognitive metrics (action prediction, communicative influence, strategic reasoning, tradeoffs under conflicting interests) from gameplay traces, reports that these metrics show strong intramodel consistency and predict pairwise agent advantage with AUC ROC = 0.82, and uses feature importance to claim that influence, transparency, and adaptability are more predictive of success than Theory of Mind inference or deep planning, advancing a testable multidimensional conception of social intelligence.

Significance. If the metric extraction proves independent of outcome signals, the work offers a controllable, reproducible framework for probing social intelligence in LLM agents and supplies empirical evidence that communicative and adaptive capacities may outweigh pure ToM or planning in driving success. The systematic cross-model evaluation and Elo-style ratings provide a useful scalar baseline, while the multidimensional metrics move beyond single-score characterizations.

major comments (2)
  1. [Abstract and Results] Abstract and results on metric extraction: the sociocognitive metrics (influence, transparency, adaptability, ToM inference) are derived from the same gameplay traces used to determine game outcomes; without explicit definitions showing that quantities such as 'successful influence' or 'adaptability' are scored without reference to post-outcome action shifts or winner determination, the reported AUC ROC = 0.82 and feature-importance ranking risk partial circularity rather than demonstrating independent predictive power.
  2. [Feature importance analysis] Feature importance analysis: the claim that influence, transparency, and adaptability outrank ToM and deep planning requires specification of the importance method (e.g., permutation, SHAP), handling of metric correlations, and any statistical controls or cross-validation across game types and data splits; absent these, the 'surprising' ranking cannot be fully evaluated for robustness.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence definition of the COMPACT protocol to orient readers before the results are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points help us strengthen the clarity and rigor of our evaluation framework. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and results on metric extraction: the sociocognitive metrics (influence, transparency, adaptability, ToM inference) are derived from the same gameplay traces used to determine game outcomes; without explicit definitions showing that quantities such as 'successful influence' or 'adaptability' are scored without reference to post-outcome action shifts or winner determination, the reported AUC ROC = 0.82 and feature-importance ranking risk partial circularity rather than demonstrating independent predictive power.

    Authors: We agree that explicit definitions are essential to demonstrate independence. In the COMPACT protocol, all sociocognitive metrics are extracted exclusively from the communication and prediction phases, which precede action execution and outcome resolution. Communicative influence is quantified as the shift in an opponent's predicted action vector following receipt of our agent's message, computed solely from the prediction logs. Adaptability is measured as the variance in an agent's own action predictions across rounds given observed history, without reference to payoffs or winners. Transparency is the accuracy with which other agents predict our agent's actions from its communications. Theory-of-mind inference and planning depth are similarly derived from pre-action traces. We will add formal mathematical definitions, pseudocode, and annotated example traces to the methods section and appendix in the revision to eliminate any ambiguity regarding independence from outcome signals. revision: yes

  2. Referee: [Feature importance analysis] Feature importance analysis: the claim that influence, transparency, and adaptability outrank ToM and deep planning requires specification of the importance method (e.g., permutation, SHAP), handling of metric correlations, and any statistical controls or cross-validation across game types and data splits; absent these, the 'surprising' ranking cannot be fully evaluated for robustness.

    Authors: We used permutation feature importance applied to a logistic regression classifier trained to predict pairwise win probabilities from the sociocognitive metrics. Importance scores were averaged over 5-fold cross-validation, with folds stratified by game type and random 80/20 data splits to ensure robustness. Metric correlations were addressed by computing pairwise Pearson coefficients and variance inflation factors; where VIF exceeded 5, we applied orthogonalization via principal component analysis on the correlated subset before importance ranking. We will expand the methods section to report the exact procedure, include the correlation matrix and VIF values, and present importance rankings with confidence intervals across game types and splits. This will allow readers to fully assess the relative predictive strength of influence, transparency, and adaptability over ToM inference and planning depth. revision: yes

Circularity Check

0 steps flagged

No significant circularity in sociocognitive metric prediction

full rationale

The paper extracts sociocognitive metrics (influence, transparency, adaptability, ToM inference, strategic reasoning) from gameplay traces under the COMPACT protocol and reports that they predict pairwise agent advantage with AUC ROC = 0.82 while feature importance ranks influence/adaptability above ToM/planning. This derivation is self-contained: the metrics are constructed from observable trace elements (actions, communications, predictions), the outcome prediction is a post-hoc statistical evaluation (AUC and importance analysis), and no step reduces the claimed result to a fitted quantity or self-citation by construction. No self-definitional, ansatz-smuggling, or uniqueness-imported steps appear; the empirical correlation between independently derived metrics and game results does not tautologically force the reported ranking or predictive power.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that social intelligence is measurable via performance and trace-derived metrics in the chosen game arena, with no free parameters or invented entities explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Social intelligence in agents can be operationalized and tested through mixed cooperative-competitive multiplayer games using the Communicate-Predict-Act protocol.
    Invoked in the setup of the evaluation arena and metric extraction process.
invented entities (1)
  • Sociocognitive metrics (communicative influence, transparency, adaptability) no independent evidence
    purpose: To provide a multidimensional characterization of social intelligence beyond scalar Elo ratings.
    Derived from analysis of gameplay traces; no external independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5526 in / 1361 out tokens · 37420 ms · 2026-05-10T16:48:47.689708+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Size and decision-making: A systematic literature review on groups and teams.MAN- AGEMENT AND ECONOMICS REVIEW, 7:14–32, 03

    [Avdiaj, 2022] Besnik Avdiaj. Size and decision-making: A systematic literature review on groups and teams.MAN- AGEMENT AND ECONOMICS REVIEW, 7:14–32, 03

  2. [2]

    Stanford University Press, Stanford,

    [Axelrod, 1984] Robert Axelrod.The Evolution Of Cooper- ation. Stanford University Press, Stanford,

  3. [3]

    Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra

    [Bakhtinet al., 2022 ] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, An- drew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Ja- cob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexand...

  4. [4]

    Prentice-Hall, Englewood Cliffs, N.J.,

    [Bandura, 1977] Albert Bandura.Social Learning Theory. Prentice-Hall, Englewood Cliffs, N.J.,

  5. [5]

    Social influence research in consumer behavior: What we learned and what we need to learn? – a hybrid sys- tematic literature review.Journal of Business Research, 162:113870,

    [Bhukya and Paul, 2023] Ramulu Bhukya and Justin Paul. Social influence research in consumer behavior: What we learned and what we need to learn? – a hybrid sys- tematic literature review.Journal of Business Research, 162:113870,

  6. [6]

    Camerer, Teck-Hua Ho, and Juin-Kuan Chong

    [Camereret al., 2004 ] Colin F. Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games*.The Quarterly Journal of Economics, 119(3):861–898, 08

  7. [7]

    Cialdini.Influence: The psychol- ogy of persuasion

    [Cialdini, 2021] Robert B. Cialdini.Influence: The psychol- ogy of persuasion. HarperCollins,

  8. [8]

    Crawford and Joel Sobel

    [Crawford and Sobel, 1982] Vincent P. Crawford and Joel Sobel. Strategic information transmission.Econometrica, 50(6):1431, Nov

  9. [9]

    Behavior engineering using quantitative re- inforcement learning models.Nature Communications, 16(1):4109,

    [Danet al., 2025 ] Ohad Dan, Ori Plonsky, and Yonatan Loewenstein. Behavior engineering using quantitative re- inforcement learning models.Nature Communications, 16(1):4109,

  10. [10]

    [Elo, 1978] Arpad E Elo.The Rating of Chessplayers, Past and Present. B. T. Batsford Limited,

  11. [11]

    arXiv preprint arXiv:2402.01704 , year=

    [Gempet al., 2024 ] Ian Gemp, Yoram Bachrach, Marc Lanctot, Roma Patel, Vibhavari Dasagi, Luke Marris, Georgios Piliouras, Siqi Liu, and Karl Tuyls. States as strings as strategies: Steering language models with game- theoretic solvers.CoRR, abs/2402.01704,

  12. [12]

    Grover, Douglas W

    [Groveret al., 2020 ] Rachel L. Grover, Douglas W. Nangle, Michelle Buffie, and Laura A. Andrews. Chapter 1 - defin- ing social skills. In Douglas W. Nangle, Cynthia A. Erd- ley, and Rebecca A. Schwartz-Mette, editors,Social Skills Across the Life Span, pages 3–24. Academic Press,

  13. [13]

    Deception abilities emerged in large language models.Proceedings of the Na- tional Academy of Sciences, 121(24),

    [Hagendorff, 2024] Thilo Hagendorff. Deception abilities emerged in large language models.Proceedings of the Na- tional Academy of Sciences, 121(24),

  14. [14]

    [Halevy, 2016] N. Halevy. Chapter one - strategic thinking. volume 54 ofAdvances in Experimental Social Psychol- ogy, pages 1–66. Academic Press,

  15. [15]

    The many routes to the ubiquitous bradley-terry model

    [Hamiltonet al., 2023 ] Ian Hamilton, Nick Tawn, and David Firth. The many routes to the ubiquitous bradley-terry model

  16. [16]

    Cook, and Geoffrey Bird

    [Happ´eet al., 2017 ] Francesca Happ ´e, Jennifer L. Cook, and Geoffrey Bird. The structure of social cognition: In(ter)dependence of sociocognitive processes.Annual Review of Psychology, 68(V olume 68, 2017):243–267,

  17. [17]

    Routledge, An Imprint Of The Taylor Et Francis Group, 7 edition,

    [Hargie, 2022] Owen Hargie.Skilled interpersonal commu- nication : research, theory and practice. Routledge, An Imprint Of The Taylor Et Francis Group, 7 edition,

  18. [18]

    Harsanyi

    [Harsanyi, 2004] John C. Harsanyi. Games with incomplete information played by Bayesian players, i–iii: Part i. the basic model.Management Science, 50:1804–1817, Dec

  19. [19]

    Horowitz

    [Horowitz, 2019] Joel L. Horowitz. Bootstrap methods in econometrics.Annual Review of Economics, 11(V olume 11, 2019):193–224,

  20. [20]

    [Huanget al., 2025 ] Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenx- iang Jiao, Xing Wang, Zhaopeng Tu, and Michael R. Lyu. Competing large language models in multi-agent gaming environments. InProceedings of the Thirteenth Interna- tional Conference on Learning Representations (ICLR),

  21. [21]

    Cambridge University Press,

    [Kelly, 2003] Anthony Kelly.Two-person mixed-motive games of strategy, page 98–134. Cambridge University Press,

  22. [22]

    Kihlstrom and Nancy Cantor.Social Intelligence, page 564–581

    [Kihlstrom and Cantor, 2011] John F. Kihlstrom and Nancy Cantor.Social Intelligence, page 564–581. Cambridge Handbooks in Psychology. Cambridge University Press,

  23. [23]

    Designing and building a negotiating au- tomated agent (diplomacy).Computational Intelligence, 11(1):132–171,

    [Kraus and Lehmann, 1995] Sarit Kraus and Daniel Lehmann. Designing and building a negotiating au- tomated agent (diplomacy).Computational Intelligence, 11(1):132–171,

  24. [24]

    Are llms effective negotiators? systematic evalua- tion of the multifaceted capabilities of llms in negotiation dialogues

    [Kwonet al., 2024 ] Deuksin Kwon, Emily Weiss, Tara Kul- shrestha, Kushal Chawla, Gale Lucas, and Jonathan Gratch. Are llms effective negotiators? systematic evalua- tion of the multifaceted capabilities of llms in negotiation dialogues. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 5391–5413, Mi- ami, Florida, USA,

  25. [25]

    [Leyton-Brown and Shoham, 2008] Kevin Leyton-Brown and Yoav Shoham.Coalitional Game Theory, pages 69–77

    Association for Computational Linguistics. [Leyton-Brown and Shoham, 2008] Kevin Leyton-Brown and Yoav Shoham.Coalitional Game Theory, pages 69–77. Springer International Publishing, Cham,

  26. [26]

    Measuring and benchmarking large language models’ capabilities to generate persuasive lan- guage

    [Pauliet al., 2025 ] Amalie Brogaard Pauli, Isabelle Augen- stein, and Ira Assent. Measuring and benchmarking large language models’ capabilities to generate persuasive lan- guage. In Luis Chiruzzo, Alan Ritter, and Lu Wang, ed- itors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguisti...

  27. [27]

    [Piattiet al., 2024 ] Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch ¨olkopf, Mrinmaya Sachan, and Rada Mihalcea

    Association for Computational Linguistics. [Piattiet al., 2024 ] Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Sch ¨olkopf, Mrinmaya Sachan, and Rada Mihalcea. Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Adv...

  28. [28]

    [Rahim, 1983] M. Rahim. Measurement of organizational conflict.Journal of General Psychology, 109:189–199, 10

  29. [29]

    John Wiley & Sons, Jun

    [Rossi, 2018] Richard J Rossi.Mathematical Statistics. John Wiley & Sons, Jun

  30. [30]

    Social IQa: Com- monsense reasoning about social interactions

    [Sapet al., 2019 ] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Com- monsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th Interna- tional Joint Conference on...

  31. [31]

    [Silveret al., 2021 ] David Silver, Satinder Singh, Doina Pre- cup, and Richard S

    Association for Computational Linguistics. [Silveret al., 2021 ] David Silver, Satinder Singh, Doina Pre- cup, and Richard S. Sutton. Reward is enough.Artificial Intelligence, 299:103535,

  32. [32]

    Du´e˜nez-Guzm´an, John P

    [Smithet al., 2025 ] Chandler Smith, Marwa Abdulhai, Man- fred Diaz, Marko Tesic, Rakshit Trivedi, Alexander Sasha Vezhnevets, Lewis Hammond, Jesse Clifton, Minsuk Chang, Edgar A. Du´e˜nez-Guzm´an, John P. Agapiou, Jayd Matyas, Danny Karmon, Akash Kundu, Aliaksei Kor- shuk, Ananya Ananya, Arrasy Rahman, Avinaash Anand Kulandaivel, Bain McHale, Beining Zha...

  33. [33]

    The theory of social games: outline of a general theory for the social sciences.Humanities & social sciences communications, 10(1), Jun

    [Stolz, 2023] J¨org Stolz. The theory of social games: outline of a general theory for the social sciences.Humanities & social sciences communications, 10(1), Jun

  34. [34]

    Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, and et al

    [Strachanet al., 2024 ] James W. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, and et al. Testing theory of mind in large language models and humans.Nature Human Be- haviour, 8(7):1285–1295, May

  35. [35]

    Sotopia: Interactive evaluation for social intelligence in language agents,

    [Zhouet al., 2024 ] Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis- Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for social intelligence in language agents,

  36. [36]

    Mixing search strategies for multi-player games

    [Zuckermanet al., 2009 ] Inon Zuckerman, Ariel Felner, and Sarit Kraus. Mixing search strategies for multi-player games. InIJCAI, volume 9, pages 646–652,

  37. [37]

    Chou, and Colin F

    [ ¨Ostlinget al., 2011 ] Robert ¨Ostling, Joseph Tao-yi Wang, Eileen Y . Chou, and Colin F. Camerer. Testing game the- ory in the field: Swedish lupi lottery games.American Economic Journal: Microeconomics, 3(3):1–33, August 2011