Recognition: unknown
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Pith reviewed 2026-05-10 08:41 UTC · model grok-4.3
The pith
LLM agents achieve under 60 percent task accuracy and near-random deception detection in embodied multi-agent settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SocialGrid reveals that LLM agents show persistent shortfalls in both planning and social reasoning inside an embodied multi-agent environment, where task completion stays below 60 percent and deception detection remains near random chance even when navigation is assisted by a Planning Oracle, indicating reliance on superficial heuristics rather than accumulated behavioral evidence.
What carries the argument
SocialGrid, an embodied multi-agent environment inspired by Among Us that supplies an optional Planning Oracle to isolate social reasoning evaluation from planning and navigation deficits.
If this is right
- Task completion stays low because agents enter repetitive loops or cannot handle basic obstacles in shared spaces.
- Deception detection remains near random chance across all tested model scales, showing social reasoning does not improve with size alone.
- Planning assistance raises overall completion rates but leaves social reasoning performance unchanged.
- Automatic failure analysis and fine-grained metrics allow developers to pinpoint exact weaknesses in navigation versus social inference.
- Elo-rated leaderboards from adversarial league play create a standardized competitive ranking for agent comparisons.
Where Pith is reading between the lines
- Future agent systems may need dedicated components for tracking other agents' action histories and intentions rather than depending on single-turn heuristics.
- The benchmark can serve as a testbed for training methods that jointly optimize embodied planning and social inference instead of treating them separately.
- Results imply that purely text-based social evaluations may miss limitations that appear only when agents must act under physical constraints and real-time interactions.
- The diagnostic tools in SocialGrid could guide creation of targeted training data focused on behavioral evidence accumulation in multi-agent settings.
Load-bearing premise
The Among Us-inspired environment together with the optional Planning Oracle isolates social reasoning deficits from planning and navigation problems without creating new behavioral confounds or task-specific biases.
What would settle it
A model that reliably detects deception at well above chance levels across varied scenarios in SocialGrid, even without the Planning Oracle, would falsify the claim of a persistent social reasoning bottleneck.
Figures
read the original abstract
As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SocialGrid, an embodied multi-agent benchmark environment inspired by Among Us, to evaluate LLM agents on planning, task execution, and social reasoning (including deception detection). It reports that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents exhibiting repetitive behaviors and navigation failures. An optional Planning Oracle is provided to isolate social reasoning from planning deficits; while this improves task completion, deception detection remains near random chance across model scales. The work includes automatic failure analysis, fine-grained metrics, and an Elo-rated leaderboard from adversarial league play.
Significance. If the benchmark design and oracle successfully isolate social reasoning without introducing new confounds, the results would be significant for demonstrating persistent limitations in LLM agents' embodied social intelligence and for supplying a diagnostic platform with automatic analysis and a competitive leaderboard. The provision of reproducible metrics and adversarial evaluation setup are notable strengths that could support targeted agent improvements.
major comments (3)
- [§3 (Planning Oracle)] §3 (Planning Oracle): The headline finding that social reasoning is the bottleneck (deception detection near random even with oracle assistance) depends on the oracle cleanly removing planning/navigation confounds. No ablations on oracle variants, no controls for path-dependent observation effects (e.g., how oracle paths alter encounters with impostor behaviors), and no non-oracle baselines on purely social subtasks are reported, leaving open whether low performance reflects genuine social deficits or interactions with the environment's information structure.
- [§5 (Experiments and Results)] §5 (Experiments and Results): Specific performance claims (e.g., <60% task completion accuracy, near-random deception detection) are presented without details on number of trials per condition, run-to-run variance, statistical significance testing, or precise metric definitions, which prevents independent verification of the central empirical claims.
- [§2 (Environment Design)] §2 (Environment Design): The Among Us-inspired grid setup is described at a high level, but the paper does not analyze or control for potential task-specific biases, such as how limited visibility or grid navigation mechanics might systematically affect the availability of deception cues independent of agent reasoning.
minor comments (3)
- [Abstract and §4] Abstract and §4: The description of 'automatic failure analysis' would benefit from a concrete example or pseudocode in the main text to illustrate how failure modes are categorized.
- [Related Work] Related Work: Additional citations to prior embodied multi-agent benchmarks (e.g., extensions of AI2-THOR or other social simulation environments) would better situate the novelty of SocialGrid.
- [Figures] Figure captions: Some figures showing agent trajectories or failure cases lack scale bars or explicit legend explanations for the grid environment.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing where revisions are needed to improve clarity and rigor, and providing our reasoning on the benchmark design.
read point-by-point responses
-
Referee: [§3 (Planning Oracle)] §3 (Planning Oracle): The headline finding that social reasoning is the bottleneck (deception detection near random even with oracle assistance) depends on the oracle cleanly removing planning/navigation confounds. No ablations on oracle variants, no controls for path-dependent observation effects (e.g., how oracle paths alter encounters with impostor behaviors), and no non-oracle baselines on purely social subtasks are reported, leaving open whether low performance reflects genuine social deficits or interactions with the environment's information structure.
Authors: We agree that further validation of the oracle would strengthen the isolation claim. In revision, we will add ablations comparing perfect oracle, noisy oracle, and no-oracle conditions, along with analysis of observation histories to control for path-dependent effects. We will also introduce non-oracle baselines by evaluating agents on isolated social subtasks (e.g., deception detection from fixed observation logs without navigation). These additions will help rule out confounds while preserving the current finding that social reasoning remains near chance even with planning assistance. revision: yes
-
Referee: [§5 (Experiments and Results)] §5 (Experiments and Results): Specific performance claims (e.g., <60% task completion accuracy, near-random deception detection) are presented without details on number of trials per condition, run-to-run variance, statistical significance testing, or precise metric definitions, which prevents independent verification of the central empirical claims.
Authors: We acknowledge the need for greater transparency on experimental details. The revised manuscript will specify the number of trials (50 episodes per model per condition), report means with standard deviations across runs, include statistical significance tests (e.g., paired t-tests), and provide explicit definitions for all metrics including task completion accuracy and deception detection rate. This will enable full independent verification of the reported results. revision: yes
-
Referee: [§2 (Environment Design)] §2 (Environment Design): The Among Us-inspired grid setup is described at a high level, but the paper does not analyze or control for potential task-specific biases, such as how limited visibility or grid navigation mechanics might systematically affect the availability of deception cues independent of agent reasoning.
Authors: The grid and visibility mechanics are core to creating embodied social scenarios analogous to Among Us. We will add a dedicated subsection in §2 that explicitly discusses these potential biases, explains the randomization of starting positions and impostor behaviors used to mitigate systematic effects, and analyzes how visibility constraints influence cue availability. This analysis will clarify that the benchmark intentionally tests integrated planning and social reasoning rather than isolating them artificially. revision: yes
Circularity Check
No circularity: purely empirical benchmark with no derivations or fitted predictions
full rationale
The paper introduces SocialGrid as a new embodied multi-agent benchmark inspired by Among Us and reports empirical performance of LLM agents on task completion, planning, and social reasoning tasks. No mathematical derivation chain, equations, or first-principles results are claimed. Results rest on direct observation of agent behaviors in the environment, with the Planning Oracle presented as an optional experimental control rather than a fitted or self-referential component. No self-citations, ansatzes, or renamings of known results appear as load-bearing steps. The work is self-contained as an empirical evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., and et al. Phi-4 technical report. arXiv Preprint:2412.08905, 2024. URL https://arxiv.org/abs/2412.08905
work page internal anchor Pith review arXiv 2024
-
[2]
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
Chang, M., Chhablani, G., Clegg, A., Dallaire Cote, M., Desai, R., Hlavac, M., Karashchuk, V., Krantz, J., Mottaghi, R., Parashar, P., Patki, S., Prasad, I., Puig, X., Rai, A., Ramrakhya, R., Tran, D., Truong, J., Turner, J., Undersander, E., and Yang, T.-Y. PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks . In Proceedings of t...
2025
-
[3]
Chen, J., Lu, Y., Wang, X., Zeng, H., Huang, J., Gesi, J., Xu, Y., Yao, B., and Wang, D. Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation . arXiv Preprint:2507.21028, 2025. URL https://arxiv.org/abs/2507.21028
-
[4]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C.-M., Yu, H., Lu, Y., Hung, Y.-H., Qian, C., Qin, Y., Cong, X., Xie, R., Liu, Z., Sun, M., and Zhou, J. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024 a . URL https://procee...
2024
-
[5]
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song
Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning . arXiv Preprint:2312.06722, 2024 b . URL https://arxiv.org/abs/2312.06722
-
[6]
S., and Terry, J
Chevalier-Boisvert, M., Dai, B., Towers, M., Perez-Vicente, R., Willems, L., Lahlou, S., Pal, S., Castro, P. S., and Terry, J. K. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://pr...
2023
-
[7]
AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game
Chi, Y., Mao, L., and Tang, Z. AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game . arXiv Preprint:2407.16521, 2024. URL https://arxiv.org/abs/2407.16521
-
[8]
Cui, J., Tang, C., Holtz, J., Nguyen, J., Allievi, A. G., Qiu, H., and Stone, P. Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play . arXiv Preprint:2505.18334, 2025. URL https://arxiv.org/abs/2505.18334
-
[9]
Understanding Social Reasoning in Language Models with Language Models
Gandhi, K., Fraenken, J.-P., Gerstenberg, T., and Goodman, N. Understanding Social Reasoning in Language Models with Language Models . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2b9efb085d3829a2aadffab63ba206de-Paper-Datasets_and_B...
2023
-
[10]
Gemma Team . Gemma 3 technical report. arXiv Preprint:2511.09768, 2025. URL https://arxiv.org/abs/2511.09768
-
[11]
Golechha, S. and Garriga-Alonso, A. Among Us: A Sandbox for Measuring and Detecting Agentic Deception . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2025. URL https://arxiv.org/abs/2504.04072
-
[12]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., and et al. The llama 3 herd of models. arXiv Preprint:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
DeepSeek - R1 incentivizes reasoning in LLMs through reinforcement learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., and et al. DeepSeek - R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645, 2025. URL https://www.nature.com/articles/s41586-025-09422-z
2025
-
[14]
V., Wiest, O., and Zhang, X
Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla, N. V., Wiest, O., and Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges . In Proceedings of International Joint Conference on Artificial Intelligence ( IJCAI ) , 2024. URL https://www.ijcai.org/proceedings/2024/890
2024
-
[15]
J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA: Low-Rank Adaptation of Large Language Models . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9
2022
-
[16]
Tamas: Benchmarking adversarial risks in multi-agent llm systems,
Kavathekar, I., Jain, H., Rathod, A., Kumaraguru, P., and Ganu, T. TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems . arXiv Preprint:2511.05269, 2025. URL https://arxiv.org/abs/2511.05269
-
[17]
In: Proceedings of the 29th Symposium on Operating Systems Principles
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention . In Proceedings of the Symposium on Operating Systems Principles (SOSP) , 2023. URL https://doi.org/10.1145/3600006.3613165
-
[18]
Theory of Mind for Multi-Agent Collaboration via Large Language Models
Li, H., Chong, Y., Stepputtis, S., Campbell, J., Hughes, D., Lewis, C., and Sycara, K. Theory of Mind for Multi-Agent Collaboration via Large Language Models . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2023. URL https://aclanthology.org/2023.emnlp-main.13/
2023
-
[19]
GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments
Li, K., Tao, Y., Wen, X., Sun, Q., Gong, Z., Xu, C., Zhang, X., and Ji, T. GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments . arXiv Preprint:2505.24306, 2025. URL https://arxiv.org/abs/2505.24306
-
[20]
From text to tactic: Evaluating llms playing the game of avalon
Light, J., Cai, M., Shen, S., and Hu, Z. AvalonBench: Evaluating LLMs Playing the Game of Avalon . arXiv Preprint:2310.05036, 2023. URL https://arxiv.org/abs/2310.05036
-
[21]
From Text to Space: Mapping Abstract Spatial Models in Llms during a Grid-World Navigation Task
Martorell, N. From Text to Space: Mapping Abstract Spatial Models in Llms during a Grid-World Navigation Task . In Explainable Artificial Intelligence, 2025. URL https://link.springer.com/chapter/10.1007/978-3-032-08330-2_13
-
[22]
Meta Fundamental AI Research Diplomacy Team (FAIR) , Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., Hu, H., Jacob, A. P., Komeili, M., Konath, K., Kwon, M., Lerer, A., Lewis, M., Miller, A. H., Mitts, S., Renduchintala, A., Roller, S., Rowe, D., Shi, W., Spisak, J., Wei, A., Wu, D., Zhang, H., and Zijlstra, M. ...
-
[23]
Mohammadi, M., Li, Y., Lo, J., and Yip, W. Evaluation and Benchmarking of LLM Agents: A survey . In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , 2025. URL https://doi.org/10.1145/3711896.3736570
-
[24]
Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models
O'Gara, A. Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models . arXiv Preprint:2308.01404, 2023. URL https://arxiv.org/abs/2308.01404
-
[25]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. GPT-OSS-120B & GPT-OSS-20B Model Card . arXiv Preprint:2508.10925, 2025. URL https://arxiv.org/abs/2508.10925
work page internal anchor Pith review arXiv 2025
-
[26]
How to Catch an AI Liar: Lie Detection in Black-Box Llms by Asking Unrelated Questions
Pacchiardi, L., Chan, A., Mindermann, S., Moscovitz, I., Pan, A., Gal, Y., Evans, O., and Brauner, J. How to Catch an AI Liar: Lie Detection in Black-Box Llms by Asking Unrelated Questions . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/efe79ae16496a0...
2024
-
[27]
Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative Agents: Interactive Simulacra of Human Behavior . In Proceedings of the Annual ACM Symposium on User Interface Software and Technologya (UIST) , 2023. URL https://doi.org/10.1145/3586183.3606763
-
[28]
D., and Campbell, M
Riemer, M., Ashktorab, Z., Bouneffouf, D., Das, P., Liu, M., Weisz, J. D., and Campbell, M. Position: Theory of Mind Benchmarks are Broken for Large Language Models . In Proceedings of the International Conference on Machine Learning ( ICML ) , 2025. URL https://proceedings.mlr.press/v267/riemer25a.html
2025
-
[29]
Sarkar, B., Xia, W., Liu, C. K., and Sadigh, D. Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning . In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) . International Foundation for Autonomous Agents and Multiagent Systems, 2025. URL https://dl.acm.org/doi/10.5555/3709347.3743819
-
[30]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms . arXiv Preprint:1707.06347, 2017. URL https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
H., Wu, J., Washington, C., Sadler, B
Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models . In Proceedings of the IEEE/CVF International Conference on Computer Vision ( ICCV ) , 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Song_LLM-Planner_Few-Shot_Grounded_Plannin...
2023
- [32]
-
[33]
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Sun, H., Zhang, S., Niu, L., Ren, L., Xu, H., Fu, H., Zhao, F., Yuan, C., and Wang, X. Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2025. URL https://aclanthology.org/2025.emnlp-main.249/
2025
-
[34]
Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction . MIT Press, 1998. URL http://www.incompleteideas.net/book/first/the-book.html
1998
-
[35]
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S., and Kambhampati, S. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change . In Proceedings of the Conference on Advances in Neural Information Processing Systems ( NeurIPS ) , 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7...
2023
-
[36]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models . Transactions on Machine Learning Research ( TMLR ) , 2024. URL https://openreview.net/pdf?id=ehfRiF0R3a
2024
-
[37]
MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs
Wang, Q., Wang, T., Tang, Z., Li, Q., Chen, N., Liang, J., and He, B. MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs . In Findings of the Association for Computational Linguistics (ACL) , 2025 a . URL https://aclanthology.org/2025.findings-acl.259/
2025
-
[38]
Wang, S., Subramanian, S., Sahni, M., Gone, P., Meng, L., Wang, X., Bertoli, N. F., Cheng, T., and Xu, J. Configurable multi-agent framework for scalable and realistic testing of llm-based agents . arXiv Preprint:2507.14705, 2025 b . URL https://arxiv.org/abs/2507.14705
-
[39]
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
Wei, H., Zhang, Z., He, S., Xia, T., Pan, S., and Liu, F. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2025. URL https://aclanthology.org/2025.acl-long.958/
2025
-
[40]
M., Kummerfeld, J
Wongkamjan, W., Gu, F., Wang, Y., Hermjakob, U., May, J., Stewart, B. M., Kummerfeld, J. K., Peskoff, D., and Boyd-Graber, J. L. More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2024. URL https://aclanthology.org/2024.acl-long.672/
2024
-
[41]
ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback
Wu, Q., Liu, W., Luan, J., and Wang, B. ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2024. URL https://aclanthology.org/2024.emnlp-main.1018/
2024
-
[42]
Science China Information Sciences , author =
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Qin, W., Zheng, Y., Qiu, X., Huang, X., Zhang, Q., and Gui, T. The Rise and Potential of Large Language Model Based Agents: A Survey . Science C...
-
[43]
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y., Xiao, Y., and Su, Y. TravelPlanner: A Benchmark for Real-World Planning with Language Agents . In Proceedings of the International Conference on Machine Learning ( ICML ) , 2024. URL https://proceedings.mlr.press/v235/xie24j.html
2024
-
[44]
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration
Xu, L., Hu, Z., Zhou, D., Ren, H., Dong, Z., Keutzer, K., Ng, S.-K., and Feng, J. MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration . In Proceedings of the Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2024 a . URL https://aclanthology.org/2024.emnlp-main.416/
2024
-
[45]
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Xu, P., Wang, S., Zhu, Y., Li, J., and Zhang, Y. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition . arXiv Preprint:2511.21471, 2025 a . URL https://arxiv.org/abs/2511.21471
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Ex- ploring large language models for communica- tion games: An empirical study on werewolf
Xu, Y., Wang, S., Li, P., Luo, F., Wang, X., Liu, W., and Liu, Y. Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf . arXiv Preprint:2309.04658, 2024 b . URL https://arxiv.org/abs/2309.04658
-
[47]
SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Xu, Z., Wang, Y., Huang, Y., Ye, J., Zhuang, H., Song, Z., Gao, L., Wang, C., Chen, Z., Zhou, Y., Li, S., Pan, W., Zhao, Y., Zhao, J., Zhang, X., and Chen, X. SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models . arXiv Preprint:2505.23713, 2025 b . URL https://arxiv.org/abs/2505.23713
-
[48]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., and et al. Qwen3 technical report. arXiv Preprint:2505.09388, 2025 a . URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
W., Han, R., Fei-Fei, L., and Xie, S
Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR ) , 2025 b . URL https://openaccess.thecvf.com/content/CVPR2025/papers/Yang_Thinking_in_Space_How_Multimod...
2025
-
[50]
Survey on Evaluation of LLM-based Agents
Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., and Shmueli-Scheuer, M. Survey on Evaluation of LLM-based Agents . arXiv Preprint:2503.16416, 2025. URL https://arxiv.org/abs/2503.16416
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi, Z., Morency, L.-P., Bisk, Y., Fried, D., Neubig, G., and Sap, M. SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents . In Proceedings of the International Conference on Learning Representations ( ICLR ) , 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024/file/b3075b88e...
2024
-
[52]
MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents
Zhu, K., Du, H., Hong, Z., Yang, X., Guo, S., Wang, Z., Wang, Z., Qian, C., Tang, R., Ji, H., and You, J. MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents . In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , 2025. URL https://aclanthology.org/2025.acl-long.421/
2025
-
[53]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.