arxiv: 2604.16022 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.LG· cs.MA

Recognition: unknown

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Hikaru Shindo , Hanzhao Lin , Lukas Helff , Patrick Schramowski , Kristian Kersting

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:41 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords LLM agentsmulti-agent systemssocial reasoningembodied AIplanningdeception detectionbenchmarkAmong Us

0 comments

The pith

LLM agents achieve under 60 percent task accuracy and near-random deception detection in embodied multi-agent settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SocialGrid, an embodied multi-agent environment modeled on Among Us, to measure how well LLM agents handle planning, navigation, task execution, and social reasoning together. Even the largest open models complete tasks below 60 percent accuracy, often looping or failing basic movement, while deception detection stays at chance levels regardless of scale. An optional Planning Oracle separates navigation help from social evaluation, showing that assistance raises completion rates but leaves social reasoning as the unchanged bottleneck because agents use shallow cues instead of building evidence from behavior over time. This evaluation setup matters for turning LLMs into autonomous agents that must cooperate or compete in shared physical spaces.

Core claim

SocialGrid reveals that LLM agents show persistent shortfalls in both planning and social reasoning inside an embodied multi-agent environment, where task completion stays below 60 percent and deception detection remains near random chance even when navigation is assisted by a Planning Oracle, indicating reliance on superficial heuristics rather than accumulated behavioral evidence.

What carries the argument

SocialGrid, an embodied multi-agent environment inspired by Among Us that supplies an optional Planning Oracle to isolate social reasoning evaluation from planning and navigation deficits.

If this is right

Task completion stays low because agents enter repetitive loops or cannot handle basic obstacles in shared spaces.
Deception detection remains near random chance across all tested model scales, showing social reasoning does not improve with size alone.
Planning assistance raises overall completion rates but leaves social reasoning performance unchanged.
Automatic failure analysis and fine-grained metrics allow developers to pinpoint exact weaknesses in navigation versus social inference.
Elo-rated leaderboards from adversarial league play create a standardized competitive ranking for agent comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent systems may need dedicated components for tracking other agents' action histories and intentions rather than depending on single-turn heuristics.
The benchmark can serve as a testbed for training methods that jointly optimize embodied planning and social inference instead of treating them separately.
Results imply that purely text-based social evaluations may miss limitations that appear only when agents must act under physical constraints and real-time interactions.
The diagnostic tools in SocialGrid could guide creation of targeted training data focused on behavioral evidence accumulation in multi-agent settings.

Load-bearing premise

The Among Us-inspired environment together with the optional Planning Oracle isolates social reasoning deficits from planning and navigation problems without creating new behavioral confounds or task-specific biases.

What would settle it

A model that reliably detects deception at well above chance levels across varied scenarios in SocialGrid, even without the Planning Oracle, would falsify the claim of a persistent social reasoning bottleneck.

Figures

Figures reproduced from arXiv: 2604.16022 by Hanzhao Lin, Hikaru Shindo, Kristian Kersting, Lukas Helff, Patrick Schramowski.

**Figure 1.** Figure 1: SocialGrid Overview. Inspired by Among Us, SocialGrid is a controllable, embodied benchmark evaluating LLM agents in multi-agent, multi-objective environments. (Left) User Input: Enables systematic control of environmental complexity (e.g., map area, room count) and agent configuration. (Center) Environment: Agents operate under physical constraints; an optional Planning Oracle isolates social reasoning fr… view at source ↗

**Figure 2.** Figure 2: LLM agents struggle with spatial navigation in embodied settings. Comparison of crewmate performance on SocialGrid under low (no assistance) and high (with planning oracle) conditions. 7 crewmates per episode; 20 episodes per model; error bars show SD. Green values indicate absolute improvement from low to high. Left: Task performance (completion rate). Middle: Planning success rate (tasks reached). Right… view at source ↗

**Figure 4.** Figure 4: Trust calibration hovers near random baseline. Trust metrics for crewmate models facing GPT-OSS-120B impostor. Left: Brier Score measures how well trust predictions match ground truth (lower is better); most models hover near the random baseline (0.33, dashed). Right: Volatility measures how erratically trust changes between turns (lower is better); values around 0.33 indicate unstable assessments. 4 6 9 … view at source ↗

**Figure 3.** Figure 3: Detection accuracy reveals below-random performance across all models. Heatmap showing crewmate detection accuracy across 36 matchups (30 cross-model league + 6 self-play diagonal). All models perform near or below the random baseline (33%, shown in colorbar), averaging 29.9% detection accuracy. The consistent near-chance performance indicates that impostor detection remains challenging regardless of mode… view at source ↗

**Figure 6.** Figure 6: Failure analysis reveals model-specific patterns. Each radar chart shows the distribution of six failure modes for a given model, normalized to the global maximum across all models. The percentage under each model name indicates total failure coverage (sum of all failure mode fractions). Failure mode abbreviations: D.S. = Door Spam (repeatedly toggling doors), P.P. = Position Ping-Pong (oscillating between… view at source ↗

**Figure 7.** Figure 7: RL training progression. Task Performance and Planning Performance of Qwen3-4B across training steps. RL training does not yield significant improvements in either condition, with or without the planning assistant. and MultiAgentBench (Zhu et al., 2025) evaluate collaboration at scale (Cui et al., 2025; Wang et al., 2025a). In contrast, SocialGrid targets adversarial, partially observable settings where … view at source ↗

**Figure 8.** Figure 8: SocialGrid Environment. The main game view showing the grid-based environment with multiple rooms connected by doors. Agents are represented as colored circles. The red block indicates a dead body. Task locations are marked throughout the map, and the agent’s limited field of view creates partial observability. Trust Score Dynamics ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Trust Score Evolution. A temporal visualization showing how each agent’s trust beliefs evolve over an episode. Each line tracks one agent’s trust score toward other players over time, revealing patterns such as gradual suspicion accumulation, sudden trust drops after suspicious behavior, and the divergence between crewmate and impostor trust dynamics [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Voting Phase. The voting interface is constantly activated by the environment. Agents submit natural language statements defending themselves or accusing others, followed by votes to eliminate a suspected impostor. The interface displays each agent’s statement, vote target, and the final tally. B. Prompting Strategy Movement System Prompt. The system prompt for crewmate movement is structured to provide c… view at source ↗

**Figure 11.** Figure 11: Effect of Planning Assistant on Failure Modes. Per-model change in failure percentage when the planning assistant is enabled. Positive values (green) indicate failures reduced by the assistant; negative values (red) indicate failures that increase. The assistant dramatically reduces passive failures (NOOP deadlock, task fixation) across most models, while active navigation failures show model-dependent ef… view at source ↗

**Figure 12.** Figure 12: Performance vs. environmental complexity. Top row: performance vs. map area (fixed 2×2 layout). Bottom row: performance vs. number of rooms (fixed 10×10 room size). Columns show Task Performance (TP), Planning Performance (PP), Voting Accuracy, and Trust Brier Score (BS). Error bars indicate standard error across episodes. models specifically designed for complex inference tasks: DeepSeek-R1-70B and Phi4… view at source ↗

**Figure 13.** Figure 13: Head-to-Head Win Rate Heatmaps. Left: Overall win rate—each cell shows the win rate of the row model against the column model on the 10×10 2x2 pattern. Darker colors indicate higher win rates. The diagonal represents self-play and is set to 0.5. Right: Impostor win rate—complementary view showing impostor win rates for the same matchups. Colors range from gray (low) to red (high). The predominantly red co… view at source ↗

**Figure 14.** Figure 14: Impostors gained advantage by the navigation assistant. Comparison of winning score transition on SocialGrid . Terminal-filled variant (to reduce survivorship bias). After an episode ends at time τ , we fill the remaining trajectory with the terminal outcome: p˜crew(t) =    pcrew(t), t ≤ τ, 1, t > τ and crewmates win, 0, t > τ and impostors win. (13) The plotted curve is the mean of pcrew(t) (or p˜cr… view at source ↗

read the original abstract

As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SocialGrid adds a concrete new embodied multi-agent benchmark with an oracle to separate planning from social tasks, but the deception results look hard to interpret without checks on how the oracle changes what agents observe.

read the letter

Hey, the core contribution here is SocialGrid itself: an Among Us-style grid environment where multiple LLM agents handle tasks while navigating potential deception from others. They add an optional Planning Oracle to cut down on navigation and repetition failures so the social part can be tested more directly, plus automatic failure logging and an Elo-based leaderboard for ongoing comparisons. That setup is new enough to be worth having around for people working on collaborative agents, and the basic finding that even strong open models stay below 60% on task completion without help matches what shows up in other agent papers. The oracle does lift completion rates, which at least demonstrates the separation idea in practice. The automatic metrics and failure modes are a practical plus that could let developers iterate faster than with ad-hoc tests. The soft spot is the oracle's isolation claim. If the plans it supplies steer agents onto different paths, that changes the observations they get about impostor behavior, and nothing in the abstract or summary shows ablations for oracle variants, path effects, or pure social subtasks run without it. That leaves open whether the near-random deception detection is a real social limit or an artifact of altered information flow. The missing details on trial counts, variance, and exact metric definitions also make the numbers harder to rely on without the full methods. This is aimed at groups building and benchmarking multi-agent LLM systems in interactive settings. It has enough of a working environment and initial data to justify sending it to referees rather than rejecting outright, so the controls and stats can get tightened.

Referee Report

3 major / 3 minor

Summary. The paper introduces SocialGrid, an embodied multi-agent benchmark environment inspired by Among Us, to evaluate LLM agents on planning, task execution, and social reasoning (including deception detection). It reports that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents exhibiting repetitive behaviors and navigation failures. An optional Planning Oracle is provided to isolate social reasoning from planning deficits; while this improves task completion, deception detection remains near random chance across model scales. The work includes automatic failure analysis, fine-grained metrics, and an Elo-rated leaderboard from adversarial league play.

Significance. If the benchmark design and oracle successfully isolate social reasoning without introducing new confounds, the results would be significant for demonstrating persistent limitations in LLM agents' embodied social intelligence and for supplying a diagnostic platform with automatic analysis and a competitive leaderboard. The provision of reproducible metrics and adversarial evaluation setup are notable strengths that could support targeted agent improvements.

major comments (3)

[§3 (Planning Oracle)] §3 (Planning Oracle): The headline finding that social reasoning is the bottleneck (deception detection near random even with oracle assistance) depends on the oracle cleanly removing planning/navigation confounds. No ablations on oracle variants, no controls for path-dependent observation effects (e.g., how oracle paths alter encounters with impostor behaviors), and no non-oracle baselines on purely social subtasks are reported, leaving open whether low performance reflects genuine social deficits or interactions with the environment's information structure.
[§5 (Experiments and Results)] §5 (Experiments and Results): Specific performance claims (e.g., <60% task completion accuracy, near-random deception detection) are presented without details on number of trials per condition, run-to-run variance, statistical significance testing, or precise metric definitions, which prevents independent verification of the central empirical claims.
[§2 (Environment Design)] §2 (Environment Design): The Among Us-inspired grid setup is described at a high level, but the paper does not analyze or control for potential task-specific biases, such as how limited visibility or grid navigation mechanics might systematically affect the availability of deception cues independent of agent reasoning.

minor comments (3)

[Abstract and §4] Abstract and §4: The description of 'automatic failure analysis' would benefit from a concrete example or pseudocode in the main text to illustrate how failure modes are categorized.
[Related Work] Related Work: Additional citations to prior embodied multi-agent benchmarks (e.g., extensions of AI2-THOR or other social simulation environments) would better situate the novelty of SocialGrid.
[Figures] Figure captions: Some figures showing agent trajectories or failure cases lack scale bars or explicit legend explanations for the grid environment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing where revisions are needed to improve clarity and rigor, and providing our reasoning on the benchmark design.

read point-by-point responses

Referee: [§3 (Planning Oracle)] §3 (Planning Oracle): The headline finding that social reasoning is the bottleneck (deception detection near random even with oracle assistance) depends on the oracle cleanly removing planning/navigation confounds. No ablations on oracle variants, no controls for path-dependent observation effects (e.g., how oracle paths alter encounters with impostor behaviors), and no non-oracle baselines on purely social subtasks are reported, leaving open whether low performance reflects genuine social deficits or interactions with the environment's information structure.

Authors: We agree that further validation of the oracle would strengthen the isolation claim. In revision, we will add ablations comparing perfect oracle, noisy oracle, and no-oracle conditions, along with analysis of observation histories to control for path-dependent effects. We will also introduce non-oracle baselines by evaluating agents on isolated social subtasks (e.g., deception detection from fixed observation logs without navigation). These additions will help rule out confounds while preserving the current finding that social reasoning remains near chance even with planning assistance. revision: yes
Referee: [§5 (Experiments and Results)] §5 (Experiments and Results): Specific performance claims (e.g., <60% task completion accuracy, near-random deception detection) are presented without details on number of trials per condition, run-to-run variance, statistical significance testing, or precise metric definitions, which prevents independent verification of the central empirical claims.

Authors: We acknowledge the need for greater transparency on experimental details. The revised manuscript will specify the number of trials (50 episodes per model per condition), report means with standard deviations across runs, include statistical significance tests (e.g., paired t-tests), and provide explicit definitions for all metrics including task completion accuracy and deception detection rate. This will enable full independent verification of the reported results. revision: yes
Referee: [§2 (Environment Design)] §2 (Environment Design): The Among Us-inspired grid setup is described at a high level, but the paper does not analyze or control for potential task-specific biases, such as how limited visibility or grid navigation mechanics might systematically affect the availability of deception cues independent of agent reasoning.

Authors: The grid and visibility mechanics are core to creating embodied social scenarios analogous to Among Us. We will add a dedicated subsection in §2 that explicitly discusses these potential biases, explains the randomization of starting positions and impostor behaviors used to mitigate systematic effects, and analyzes how visibility constraints influence cue availability. This analysis will clarify that the benchmark intentionally tests integrated planning and social reasoning rather than isolating them artificially. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces SocialGrid as a new embodied multi-agent benchmark inspired by Among Us and reports empirical performance of LLM agents on task completion, planning, and social reasoning tasks. No mathematical derivation chain, equations, or first-principles results are claimed. Results rest on direct observation of agent behaviors in the environment, with the Planning Oracle presented as an optional experimental control rather than a fitted or self-referential component. No self-citations, ansatzes, or renamings of known results appear as load-bearing steps. The work is self-contained as an empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is the creation and initial use of a new benchmark environment rather than any derivation from axioms or fitting of parameters; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5485 in / 1112 out tokens · 27335 ms · 2026-05-10T08:41:38.949390+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 28 canonical work pages · 7 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., and et al. Phi-4 technical report. arXiv Preprint:2412.08905, 2024. URL https://arxiv.org/abs/2412.08905

work page internal anchor Pith review arXiv 2024
[2]

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Chang, M., Chhablani, G., Clegg, A., Dallaire Cote, M., Desai, R., Hlavac, M., Karashchuk, V., Krantz, J., Mottaghi, R., Parashar, P., Patki, S., Prasad, I., Puig, X., Rai, A., Ramrakhya, R., Tran, D., Truong, J., Turner, J., Undersander, E., and Yang, T.-Y. PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks . In Proceedings of t...

2025
[3]

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Chen, J., Lu, Y., Wang, X., Zeng, H., Huang, J., Gesi, J., Xu, Y., Yao, B., and Wang, D. Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation . arXiv Preprint:2507.21028, 2025. URL https://arxiv.org/abs/2507.21028

work page arXiv 2025
[4]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.-M., Yu, H., Lu, Y., Hung, Y.-H., Qian, C., Qin, Y., Cong, X., Xie, R., Liu, Z., Sun, M., and Zhou, J. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024 a . URL https://procee...

2024
[5]

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song

Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning . arXiv Preprint:2312.06722, 2024 b . URL https://arxiv.org/abs/2312.06722

work page arXiv 2024
[6]

S., and Terry, J

Chevalier-Boisvert, M., Dai, B., Towers, M., Perez-Vicente, R., Willems, L., Lahlou, S., Pal, S., Castro, P. S., and Terry, J. K. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://pr...

2023
[7]

AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game

Chi, Y., Mao, L., and Tang, Z. AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game . arXiv Preprint:2407.16521, 2024. URL https://arxiv.org/abs/2407.16521

work page arXiv 2024
[8]

G., Qiu, H., and Stone, P

Cui, J., Tang, C., Holtz, J., Nguyen, J., Allievi, A. G., Qiu, H., and Stone, P. Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play . arXiv Preprint:2505.18334, 2025. URL https://arxiv.org/abs/2505.18334

work page arXiv 2025
[9]

Understanding Social Reasoning in Language Models with Language Models

Gandhi, K., Fraenken, J.-P., Gerstenberg, T., and Goodman, N. Understanding Social Reasoning in Language Models with Language Models . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2b9efb085d3829a2aadffab63ba206de-Paper-Datasets_and_B...

2023
[10]

Gemma 3 technical report

Gemma Team . Gemma 3 technical report. arXiv Preprint:2511.09768, 2025. URL https://arxiv.org/abs/2511.09768

work page arXiv 2025
[11]

and Garriga-Alonso, A

Golechha, S. and Garriga-Alonso, A. Among Us: A Sandbox for Measuring and Detecting Agentic Deception . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2025. URL https://arxiv.org/abs/2504.04072

work page arXiv 2025
[12]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., and et al. The llama 3 herd of models. arXiv Preprint:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

DeepSeek - R1 incentivizes reasoning in LLMs through reinforcement learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., and et al. DeepSeek - R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645, 2025. URL https://www.nature.com/articles/s41586-025-09422-z

2025
[14]

V., Wiest, O., and Zhang, X

Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges . In Proceedings of International Joint Conference on Artificial Intelligence ( IJCAI ) , 2024. URL https://www.ijcai.org/proceedings/2024/890

2024
[15]

J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA: Low-Rank Adaptation of Large Language Models . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

2022
[16]

Tamas: Benchmarking adversarial risks in multi-agent llm systems,

Kavathekar, I., Jain, H., Rathod, A., Kumaraguru, P., and Ganu, T. TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems . arXiv Preprint:2511.05269, 2025. URL https://arxiv.org/abs/2511.05269

work page arXiv 2025
[17]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention . In Proceedings of the Symposium on Operating Systems Principles (SOSP) , 2023. URL https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[18]

Theory of Mind for Multi-Agent Collaboration via Large Language Models

Li, H., Chong, Y., Stepputtis, S., Campbell, J., Hughes, D., Lewis, C., and Sycara, K. Theory of Mind for Multi-Agent Collaboration via Large Language Models . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2023. URL https://aclanthology.org/2023.emnlp-main.13/

2023
[19]

GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments

Li, K., Tao, Y., Wen, X., Sun, Q., Gong, Z., Xu, C., Zhang, X., and Ji, T. GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments . arXiv Preprint:2505.24306, 2025. URL https://arxiv.org/abs/2505.24306

work page arXiv 2025
[20]

From text to tactic: Evaluating llms playing the game of avalon

Light, J., Cai, M., Shen, S., and Hu, Z. AvalonBench: Evaluating LLMs Playing the Game of Avalon . arXiv Preprint:2310.05036, 2023. URL https://arxiv.org/abs/2310.05036

work page arXiv 2023
[21]

From Text to Space: Mapping Abstract Spatial Models in Llms during a Grid-World Navigation Task

Martorell, N. From Text to Space: Mapping Abstract Spatial Models in Llms during a Grid-World Navigation Task . In Explainable Artificial Intelligence, 2025. URL https://link.springer.com/chapter/10.1007/978-3-032-08330-2_13

work page doi:10.1007/978-3-032-08330-2_13 2025
[22]

Human-level play in the game of Diplomacy by combining language models with strategic reasoning.Science, 378 (6624):1067–1074, 2022

Meta Fundamental AI Research Diplomacy Team (FAIR) , Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., Hu, H., Jacob, A. P., Komeili, M., Konath, K., Kwon, M., Lerer, A., Lewis, M., Miller, A. H., Mitts, S., Renduchintala, A., Roller, S., Rowe, D., Shi, W., Spisak, J., Wei, A., Wu, D., Zhang, H., and Zijlstra, M. ...

work page doi:10.1126/science.ade9097 2022
[23]

Hervé Moulin

Mohammadi, M., Li, Y., Lo, J., and Yip, W. Evaluation and Benchmarking of LLM Agents: A survey . In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , 2025. URL https://doi.org/10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025
[24]

Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

O'Gara, A. Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models . arXiv Preprint:2308.01404, 2023. URL https://arxiv.org/abs/2308.01404

work page arXiv 2023
[25]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. GPT-OSS-120B & GPT-OSS-20B Model Card . arXiv Preprint:2508.10925, 2025. URL https://arxiv.org/abs/2508.10925

work page internal anchor Pith review arXiv 2025
[26]

How to Catch an AI Liar: Lie Detection in Black-Box Llms by Asking Unrelated Questions

Pacchiardi, L., Chan, A., Mindermann, S., Moscovitz, I., Pan, A., Gal, Y., Evans, O., and Brauner, J. How to Catch an AI Liar: Lie Detection in Black-Box Llms by Asking Unrelated Questions . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/efe79ae16496a0...

2024
[27]

ISBN 9798400701320

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative Agents: Interactive Simulacra of Human Behavior . In Proceedings of the Annual ACM Symposium on User Interface Software and Technologya (UIST) , 2023. URL https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[28]

D., and Campbell, M

Riemer, M., Ashktorab, Z., Bouneffouf, D., Das, P., Liu, M., Weisz, J. D., and Campbell, M. Position: Theory of Mind Benchmarks are Broken for Large Language Models . In Proceedings of the International Conference on Machine Learning ( ICML ) , 2025. URL https://proceedings.mlr.press/v267/riemer25a.html

2025
[29]

Rustagi, Y

Sarkar, B., Xia, W., Liu, C. K., and Sadigh, D. Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning . In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . International Foundation for Autonomous Agents and Multiagent Systems, 2025. URL https://dl.acm.org/doi/10.5555/3709347.3743819

work page doi:10.5555/3709347.3743819 2025
[30]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms . arXiv Preprint:1707.06347, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

H., Wu, J., Washington, C., Sadler, B

Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models . In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Song_LLM-Planner_Few-Shot_Grounded_Plannin...

2023
[32]

Stogiannidis, I., McDonagh, S., and Tsaftaris, S. A. Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models . arXiv Preprint:2503.19707, 2025. URL https://arxiv.org/abs/2503.19707

work page arXiv 2025
[33]

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Sun, H., Zhang, S., Niu, L., Ren, L., Xu, H., Fu, H., Zhao, F., Yuan, C., and Wang, X. Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2025. URL https://aclanthology.org/2025.emnlp-main.249/

2025
[34]

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction . MIT Press, 1998. URL http://www.incompleteideas.net/book/first/the-book.html

1998
[35]

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kambhampati, S. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7...

2023
[36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models . Transactions on Machine Learning Research ( TMLR ) , 2024. URL https://openreview.net/pdf?id=ehfRiF0R3a

2024
[37]

MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs

Wang, Q., Wang, T., Tang, Z., Li, Q., Chen, N., Liang, J., and He, B. MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs . In Findings of the Association for Computational Linguistics (ACL) , 2025 a . URL https://aclanthology.org/2025.findings-acl.259/

2025
[38]

F., Cheng, T., and Xu, J

Wang, S., Subramanian, S., Sahni, M., Gone, P., Meng, L., Wang, X., Bertoli, N. F., Cheng, T., and Xu, J. Configurable multi-agent framework for scalable and realistic testing of llm-based agents . arXiv Preprint:2507.14705, 2025 b . URL https://arxiv.org/abs/2507.14705

work page arXiv 2025
[39]

PlanGenLLMs: A Modern Survey of LLM Planning Capabilities

Wei, H., Zhang, Z., He, S., Xia, T., Pan, S., and Liu, F. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2025. URL https://aclanthology.org/2025.acl-long.958/

2025
[40]

M., Kummerfeld, J

Wongkamjan, W., Gu, F., Wang, Y., Hermjakob, U., May, J., Stewart, B. M., Kummerfeld, J. K., Peskoff, D., and Boyd-Graber, J. L. More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2024. URL https://aclanthology.org/2024.acl-long.672/

2024
[41]

ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

Wu, Q., Liu, W., Luan, J., and Wang, B. ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2024. URL https://aclanthology.org/2024.emnlp-main.1018/

2024
[42]

Science China Information Sciences , author =

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Qin, W., Zheng, Y., Qiu, X., Huang, X., Zhang, Q., and Gui, T. The Rise and Potential of Large Language Model Based Agents: A Survey . Science C...

work page doi:10.1007/s11432-024-4222-0 2025
[43]

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., and Su, Y. TravelPlanner: A Benchmark for Real-World Planning with Language Agents . In Proceedings of the International Conference on Machine Learning ( ICML ) , 2024. URL https://proceedings.mlr.press/v235/xie24j.html

2024
[44]

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Xu, L., Hu, Z., Zhou, D., Ren, H., Dong, Z., Keutzer, K., Ng, S.-K., and Feng, J. MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2024 a . URL https://aclanthology.org/2024.emnlp-main.416/

2024
[45]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Xu, P., Wang, S., Zhu, Y., Li, J., and Zhang, Y. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition . arXiv Preprint:2511.21471, 2025 a . URL https://arxiv.org/abs/2511.21471

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Ex- ploring large language models for communica- tion games: An empirical study on werewolf

Xu, Y., Wang, S., Li, P., Luo, F., Wang, X., Liu, W., and Liu, Y. Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf . arXiv Preprint:2309.04658, 2024 b . URL https://arxiv.org/abs/2309.04658

work page arXiv 2024
[47]

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Xu, Z., Wang, Y., Huang, Y., Ye, J., Zhuang, H., Song, Z., Gao, L., Wang, C., Chen, Z., Zhou, Y., Li, S., Pan, W., Zhao, Y., Zhao, J., Zhang, X., and Chen, X. SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models . arXiv Preprint:2505.23713, 2025 b . URL https://arxiv.org/abs/2505.23713

work page arXiv 2025
[48]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., and et al. Qwen3 technical report. arXiv Preprint:2505.09388, 2025 a . URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

W., Han, R., Fei-Fei, L., and Xie, S

Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 2025 b . URL https://openaccess.thecvf.com/content/CVPR2025/papers/Yang_Thinking_in_Space_How_Multimod...

2025
[50]

Survey on Evaluation of LLM-based Agents

Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., and Shmueli-Scheuer, M. Survey on Evaluation of LLM-based Agents . arXiv Preprint:2503.16416, 2025. URL https://arxiv.org/abs/2503.16416

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L.-P., Bisk, Y., Fried, D., Neubig, G., and Sap, M. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/b3075b88e...

2024
[52]

MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents

Zhu, K., Du, H., Hong, Z., Yang, X., Guo, S., Wang, Z., Wang, Z., Qian, C., Tang, R., Ji, H., and You, J. MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2025. URL https://aclanthology.org/2025.acl-long.421/

2025
[53]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...