SOTOPIA-TOM benchmark reveals that even GPT-5 scores only 62% on information management in multi-agent interactions, with Theory-of-Mind prompting cutting privacy violations and raising overall scores.
Out of the 160 generated scenarios, human review identified 4 errors: two were incorrect knowledge-domain-to-role mappings and two contained unreachable agent goals
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.MA 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SOTOPIA-TOM: Evaluating Information Management in Multi-Agent Interaction with Theory of Mind
SOTOPIA-TOM benchmark reveals that even GPT-5 scores only 62% on information management in multi-agent interactions, with Theory-of-Mind prompting cutting privacy violations and raising overall scores.