arxiv: 2604.01487 · v2 · submitted 2026-04-01 · 💻 cs.AI · cs.SI

Recognition: no theorem link

AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks

Prince Zizhuang Wang , Shuli Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3

classification 💻 cs.AI cs.SI

keywords privacy risksLLM agentsagentic social networksinformation leakageabstraction paradoxmulti-agent coordinationbenchmark evaluationhuman-centered agents

0 comments

The pith

Privacy in networks of LLM agents is harder to protect than in single agents because coordination across domains and users creates ongoing leakage even under explicit instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSocialBench to measure privacy risks when teams of LLM agents work together for different users across multiple domains. It shows that cross-domain and cross-user coordination keeps leaking sensitive information despite agents being told to protect it. Instructions meant to make agents summarize or abstract private details instead cause them to discuss those details more often, a pattern the authors call the abstraction paradox. This matters for emerging personalized agent systems because prompt-based safeguards appear insufficient to prevent real information exposure in social coordination.

Core claim

In human-centered agentic social networks, where collaborative LLM agents serve individual users across domains while interacting with other users' agents, privacy preservation is fundamentally harder than in single-agent settings: cross-domain and cross-user coordination generates persistent leakage pressure even when agents receive explicit privacy instructions, and privacy instructions that teach abstraction of sensitive information paradoxically increase discussion of that information.

What carries the argument

AgentSocialBench, a benchmark of scenarios across seven categories of dyadic and multi-party interactions built on realistic user profiles with hierarchical sensitivity labels and directed social graphs, used to quantify leakage under different prompt conditions.

If this is right

Cross-domain and cross-user agent coordination produces measurable leakage even when agents are explicitly instructed to protect information.
Privacy instructions that direct agents to abstract or generalize sensitive details cause increased discussion of those details.
Current LLM agents lack built-in mechanisms that reliably prevent leakage in networked, multi-user settings.
Safe real-world deployment of agent-mediated social coordination requires methods beyond prompt engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs may need native privacy architectures that operate independently of user instructions rather than depending on prompt compliance.
The same coordination pressures could appear in other multi-agent deployments such as enterprise workflow systems or shared personal assistants.
Developers should validate benchmarks against dynamic, evolving social graphs drawn from actual usage data before scaling.

Load-bearing premise

The benchmark's constructed scenarios, hierarchical user profiles, and directed social graphs accurately represent real-world human-centered agentic social networks and the tested LLM agents reflect typical deployed behavior.

What would settle it

Running the same leakage measurements on actual deployed multi-agent teams interacting with real users in live social platforms and comparing the observed leakage rates and abstraction-paradox frequency against the benchmark results.

Figures

Figures reproduced from arXiv: 2604.01487 by Prince Zizhuang Wang, Shuli Jiang.

**Figure 1.** Figure 1: Overview of AGENTSOCIALBENCH. 1 Introduction Large language models (Achiam et al., 2023; Team et al., 2023; Anthropic, 2024; Yang et al., 2024a; DeepSeek-AI, 2024) have enabled a new generation of multi-agent systems in which specialized AI agents collaborate, negotiate, and compete across diverse interactive environments (Park et al., 2023; Hong et al., 2024; Chen et al., 2024; Qian et al., 2025). These 1… view at source ↗

**Figure 2.** Figure 2: Overview of AGENTSOCIALBENCH. Top Left: User profile with sensitivity-labeled attributes. Center: Scenario specification with privacy boundaries (top) and the seven interaction topologies (bottom). Right: Privacy-preserving vs. leaking agent response. two broad interaction settings: dyadic interactions, involving two parties (a user’s own agent team, or a pair of users/agents), and multi-party interactions… view at source ↗

**Figure 3.** Figure 3: Privacy instruction effects aggregated across all dyadic categories (CD, MC, CU; [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Behavioral pattern frequencies (%) under L0 (unconstrained, top) and L2 (full [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of privacy instruction levels (L0, L1, L2) on three metrics across three [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-domain leakage rates (%) by source [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Full version of Figure [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Per-category comparison across dyadic categories (CD, CU, MC) for leakage rate, [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Per-category comparison across multi-party categories (GC, HS, CM, AM) for [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

With the rise of personalized, persistent LLM agent frameworks such as OpenClaw, human-centered agentic social networks in which teams of collaborative AI agents serve individual users in a social network across multiple domains are becoming a reality. This setting creates novel privacy challenges: agents must coordinate across domain boundaries, mediate between humans, and interact with other users' agents, all while protecting sensitive personal information. While prior work has evaluated multi-agent coordination and privacy preservation, the dynamics and privacy risks of human-centered agentic social networks remain unexplored. To this end, we introduce AgentSocialBench, the first benchmark to systematically evaluate privacy risk in this setting, comprising scenarios across seven categories spanning dyadic and multi-party interactions, grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs. Our experiments reveal that privacy in agentic social networks is fundamentally harder than in single-agent settings: (1) cross-domain and cross-user coordination creates persistent leakage pressure even when agents are explicitly instructed to protect information, (2) privacy instructions that teach agents how to abstract sensitive information paradoxically cause them to discuss it more (we call it abstraction paradox). These findings underscore that current LLM agents lack robust mechanisms for privacy preservation in human-centered agentic social networks, and that new approaches beyond prompt engineering are needed to make agent-mediated social coordination safe for real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentSocialBench is a solid first benchmark for privacy in agent social networks, but the lack of real-world validation keeps the stronger claims preliminary.

read the letter

The punchline on this one is that AgentSocialBench is the first benchmark built specifically for privacy risks when LLM agents coordinate across users in a social network, and it documents an abstraction paradox where privacy instructions backfire by increasing discussion of sensitive topics. What the paper does well is construct a set of scenarios across seven categories that include both simple two-person exchanges and larger group interactions. These are grounded in user profiles that carry hierarchical sensitivity labels and connected through directed social graphs. The experiments then test several LLM agents under explicit privacy instructions and show consistent leakage during cross-domain coordination. The abstraction paradox is presented as a distinct finding not covered in earlier single-agent or basic multi-agent privacy work. The main limitation is that all of this rests on synthetic profiles and graphs. The authors describe them as realistic, but there is no external validation such as comparison to real deployment logs or feedback from actual users. Without that, the persistent leakage and the paradox could be tied to how the test cases were designed rather than being inherent to agentic social networks. The abstract also leaves out details like exact prompt templates, number of runs, error bars, or how scenarios were selected, which makes it difficult to judge the robustness of the reported effects. This kind of work is for people who are building or auditing persistent agent systems meant to operate in social or collaborative environments. A reader who needs concrete test cases for privacy evaluation would get value from the benchmark structure and the identified failure modes, even if they plan to adapt the scenarios. I think it deserves a serious referee. A new benchmark in an emerging area like this can help move the conversation forward, provided the authors add more validation and statistical detail in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AgentSocialBench, the first benchmark for evaluating privacy risks in human-centered agentic social networks where teams of collaborative LLM agents serve individual users across multiple domains. It claims that privacy is fundamentally harder than in single-agent settings because cross-domain and cross-user coordination creates persistent leakage pressure even under explicit protection instructions, and because privacy instructions that teach agents to abstract sensitive information paradoxically cause them to discuss it more (the 'abstraction paradox'). The benchmark comprises scenarios across seven categories grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs; experiments on LLM agents demonstrate these risks and conclude that new approaches beyond prompt engineering are needed.

Significance. If the results hold, this work is significant for highlighting novel privacy challenges in emerging multi-agent LLM systems and for providing the first systematic benchmark in this setting. It offers empirical evidence of coordination-induced leakage and the abstraction paradox, which could inform safer designs for agent-mediated social coordination. The benchmark's grounding in constructed but realistic profiles and graphs is a strength for reproducibility and future extensions.

major comments (2)

[§3 and §4] §3 (Benchmark Construction) and §4 (Experiments): The central claims that cross-domain coordination creates persistent leakage and that the abstraction paradox is fundamental rest on the assumption that the synthetic scenarios, hierarchical sensitivity labels, and directed social graphs accurately represent real-world human-centered agentic social networks. The manuscript describes them as grounded in realistic profiles but provides no user studies, external validation against logs from systems like OpenClaw, or comparison to deployed deployments, which is load-bearing for generalizing beyond the constructed setup.
[§5] §5 (Results): The abstract reports clear findings on leakage rates and the abstraction paradox, but the manuscript lacks details on error bars, statistical significance tests, data exclusion rules, and full experimental protocols. This makes it difficult to assess the robustness of the observed effects and undermines verification of the claims that leakage persists even with explicit instructions.

minor comments (2)

[Abstract] Abstract: The phrase 'abstraction paradox' is introduced without a one-sentence definition; adding a brief parenthetical explanation would improve immediate clarity for readers.
[§2] §2 (Related Work): Ensure comprehensive citation of prior multi-agent coordination and privacy benchmarks to better position the novelty of AgentSocialBench.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point-by-point below with honest responses and indicate revisions made to strengthen the work.

read point-by-point responses

Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Experiments): The central claims that cross-domain coordination creates persistent leakage and that the abstraction paradox is fundamental rest on the assumption that the synthetic scenarios, hierarchical sensitivity labels, and directed social graphs accurately represent real-world human-centered agentic social networks. The manuscript describes them as grounded in realistic profiles but provides no user studies, external validation against logs from systems like OpenClaw, or comparison to deployed deployments, which is load-bearing for generalizing beyond the constructed setup.

Authors: We agree that the lack of user studies or direct validation against real deployment logs (e.g., from systems like OpenClaw) limits strong claims of generalizability to all real-world settings. Our scenarios were constructed by synthesizing patterns from public privacy literature, demographic statistics, and common social network structures to create controlled, reproducible test cases. In the revision, we have expanded §3 with a detailed 'Construction Methodology' subsection explaining the derivation of hierarchical sensitivity labels and directed graphs, and added an explicit 'Limitations and Future Work' paragraph acknowledging the synthetic nature and calling for future empirical validation with real user data. We cannot perform such validation in this revision due to lack of access to proprietary logs, but the benchmark remains valuable for isolating coordination-induced risks in a standardized environment. revision: partial
Referee: [§5] §5 (Results): The abstract reports clear findings on leakage rates and the abstraction paradox, but the manuscript lacks details on error bars, statistical significance tests, data exclusion rules, and full experimental protocols. This makes it difficult to assess the robustness of the observed effects and undermines verification of the claims that leakage persists even with explicit instructions.

Authors: We appreciate this observation and have addressed it directly. The revised manuscript now includes error bars (standard deviation across 100 independent runs per condition) on all figures in §5. We report results of paired t-tests for key comparisons (e.g., leakage rates with vs. without privacy instructions), including p-values and effect sizes. Data exclusion rules are stated explicitly: no trials were excluded. Full experimental protocols—including all prompt templates, model versions (e.g., GPT-4, Claude-3), temperature settings, number of runs, and evaluation rubrics—have been added to a new Appendix C. These changes allow independent verification of the persistent leakage and abstraction paradox effects. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical benchmark evaluation

full rationale

The paper introduces AgentSocialBench as an empirical evaluation framework consisting of constructed scenarios, user profiles, hierarchical sensitivity labels, and directed social graphs. It reports experimental observations of LLM agent behaviors (persistent leakage under instructions, abstraction paradox) without any mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce claims to tautological inputs. All central findings are presented as direct results of running agents on the benchmark scenarios rather than being defined into existence by the setup itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the seven categories of scenarios and hierarchical sensitivity labels capture realistic privacy dynamics; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Constructed user profiles with hierarchical sensitivity labels and directed social graphs provide a valid proxy for real human data sensitivity and social interactions in agentic networks.
Stated in the abstract as the grounding for the benchmark scenarios.

pith-pipeline@v0.9.0 · 5537 in / 1309 out tokens · 39978 ms · 2026-05-13T21:46:44.320713+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

The Claude 3 model family: Opus , Sonnet , Haiku

Anthropic . The Claude 3 model family: Opus , Sonnet , Haiku . 2024. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page 2024
[4]

Agent V erse: Facilitating multi-agent collaboration and exploring emergent behaviors

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. Agent V erse: Facilitating multi-agent collaboration and exploring emergent behaviors. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[5]

Collective behavior of AI agents: the case of Moltbook

Giordano De Marzo and David Garcia. Collective behavior of AI agents: the case of Moltbook . arXiv preprint arXiv:2602.09270, 2026

work page arXiv 2026
[6]

DeepSeek-V3 Technical Report

DeepSeek-AI . DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Available: https://arxiv.org/abs/2602.11510

Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. AgentLeak : A full-stack benchmark for privacy leakage in multi-agent LLM systems. arXiv preprint arXiv:2602.11510, 2026

work page arXiv 2026
[8]

MoltNet: Understanding Social Behavior of AI Agents in the Agent-Native MoltBook

Yi Feng, Chen Huang, Zhibo Man, Ryner Tan, Long P. Hoang, Shaoyang Xu, and Wenxuan Zhang. MoltNet : Understanding social behavior of AI agents in the agent-native MoltBook . arXiv preprint arXiv:2602.13458, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

S3: Social-network simulation system with large language model-empowered agents

Chen Gao, Xiaochong Lan, Zhihong Lu, Jinzhu Mao, Jinghua Piao, Huandong Wang, Depeng Jin, and Yong Li. S ^3 : Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984, 2023

work page arXiv 2023
[10]

MindAgent : Emergent gaming interaction

Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, and Jianfeng Gao. MindAgent : Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023

work page arXiv 2023
[11]

Announcing the agent2agent protocol ( A2A )

Google . Announcing the agent2agent protocol ( A2A ). https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/, 2025. Open protocol for agent-to-agent communication

work page 2025
[12]

Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. The emerged security and privacy of LLM agent: A survey with case studies. ACM Computing Surveys, 2025. doi:10.1145/3773080

work page doi:10.1145/3773080 2025
[13]

MetaGPT : Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Wang, Zili Zhang, Steven Ka Shing Yau Wang, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J \"u rgen Schmidhuber. MetaGPT : Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[14]

humans wel- come to observe

Yukun Jiang, Yage Zhang, Xinyue Shen, Michael Backes, and Yang Zhang. ``humans welcome to observe'': A first look at the agent social network Moltbook . arXiv preprint arXiv:2602.10127, 2026

work page arXiv 2026
[15]

MAGPIE : A benchmark for multi- AG ent contextual PrI vacy E valuation

Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, and William Yang Wang. MAGPIE : A benchmark for multi- AG ent contextual PrI vacy E valuation. arXiv preprint arXiv:2510.15186, 2025

work page arXiv 2025
[16]

PrivLM-Bench : A multi-level privacy evaluation benchmark for language models

Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yuan Yao, and Yangqiu Song. PrivLM-Bench : A multi-level privacy evaluation benchmark for language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[17]

& Zhou, T

Ming Li, Xirui Li, and Tianyi Zhou. Does socialization emerge in AI agent society? A case study of Moltbook . arXiv preprint arXiv:2602.14299, 2026

work page arXiv 2026
[18]

Topology matters: Measuring memory leakage in multi-agent LLMs.arXiv preprintarXiv:2512.04668, 2025

Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yan Liu, Yue Zhao, and Xiyang Hu. Topology matters: Measuring memory leakage in multi-agent LLM s. arXiv preprint arXiv:2512.04668, 2025

work page arXiv 2025
[19]

Can LLM s keep a secret? testing privacy implications of language models via contextual integrity theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLM s keep a secret? testing privacy implications of language models via contextual integrity theory. In International Conference on Learning Representations, 2024

work page 2024
[20]

Cimemories: A compositional benchmark for contextual integrity of persistent memory in llms, 2025

Niloofar Mireshghallah, Neal Mangaokar, Narine Kokhlikyan, Arman Zharmagambetov, Manzil Zaheer, Saeed Mahloujifar, and Kamalika Chaudhuri. CIMemories : A compositional benchmark for contextual integrity of persistent memory in LLM s. arXiv preprint arXiv:2511.14937, 2025

work page arXiv 2025
[21]

Privacy as contextual integrity

Helen Nissenbaum. Privacy as contextual integrity. Washington Law Review, 79 0 (1): 0 119--158, 2004

work page 2004
[22]

OpenClaw : Open-source agent framework

OpenClaw Contributors . OpenClaw : Open-source agent framework. https://github.com/anthropics/openclaw, 2025. Open-source agent framework for autonomous operation across messaging, calendars, and social media. 247K+ GitHub stars as of January 2026

work page 2025
[23]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C O'Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, and Yong Li. AgentSociety : Large-scale simulation of LLM -driven generative agents advances understanding of human behaviors and society. arXiv preprint arXiv:2502.08691, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Scaling large language model-based multi-agent collaboration

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model-based multi-agent collaboration. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

PrivacyLens : Evaluating privacy norm awareness of language models in action

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. PrivacyLens : Evaluating privacy norm awareness of language models in action. In Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2024

work page 2024
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

arXiv preprint arXiv:2411.11581 , year=

Ziyi Yang, Zaibin Zhang, Zirui Zheng, Yuxian Jiang, Ziyue Gan, Zhiyu Wang, Zijian Ling, Jinsong Chen, Martz Ma, Bowen Dong, Prateek Gupta, Shuyue Hu, Zhenfei Yin, Guohao Li, Xu Jia, Lijun Wang, Bernard Ghanem, Huchuan Lu, Chaochao Lu, Wanli Ouyang, Yu Qiao, Philip Torr, and Jing Shao. OASIS : Open agent social interaction simulations with one million agen...

work page arXiv 2024
[30]

Agentdam: Privacy leakage evaluation for autonomous web agents,

Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, and Kamalika Chaudhuri. AgentDAM : Privacy leakage evaluation for autonomous web agents. arXiv preprint arXiv:2503.09780, 2025

work page arXiv 2025
[31]

Judging LLM -as-a-judge with MT -bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging LLM -as-a-judge with MT -bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[32]

SOTOPIA : Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA : Interactive evaluation for social intelligence in language agents. In International Conference on Learning Representations, 2024

work page 2024
[33]

Multiagentbench: Evaluating the collaboration and competition of llm agents,

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. MultiAgentBench : Evaluating the collaboration and competition of LLM agents. arXiv preprint arXiv:2503.01935, 2025

work page arXiv 2025
[34]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[35]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[36]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv