SCENE: Recognizing Social Norms and Sanctioning in Group Chats

Maksymilian Bilski; Mateusz Jacniacki

arxiv: 2605.07823 · v1 · submitted 2026-05-08 · 💻 cs.CL

SCENE: Recognizing Social Norms and Sanctioning in Group Chats

Mateusz Jacniacki , Maksymilian Bilski This is my paper

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords social normsLLM agentsgroup chatsimplicit normssocial sanctioningbenchmarkmulti-party interactionbehavioral adaptation

0 comments

The pith

SCENE benchmark shows Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit group chat norms significantly better than open-weight models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCENE, a benchmark that generates multi-party chat scenarios built around a hidden norm followed by scripted personas. The tested LLM agent is given chances to violate the norm, after which peers apply social sanctions, and the agent's subsequent behavior is scored for two kinds of adaptation: responding to sanctions and learning the norm from observed peer conduct. Results across six models indicate that the two closed frontier systems adjust their behavior more effectively than the open-weight models evaluated. This line of testing matters because LLMs are moving into roles as conversational agents in social spaces where unstated rules govern participation. The work shifts evaluation from static prompts toward dynamic, sanction-driven interactions.

Core claim

SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. It defines behavioral metrics for responsiveness to negative sanctioning and for adapting the norm from peers' behavior. On this benchmark, Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models.

What carries the argument

SCENE benchmark, which constructs dynamic group-chat interactions around an implicit norm, violation opportunities, and scripted peer sanctions to measure adaptation.

Load-bearing premise

The generated scenarios with scripted personas accurately represent real implicit norms and sanctioning behaviors in human group chats.

What would settle it

Comparing the same models' adaptation rates when placed in live human group chats that contain comparable norm violations would falsify the claim if the performance ordering reverses or disappears.

Figures

Figures reproduced from arXiv: 2605.07823 by Maksymilian Bilski, Mateusz Jacniacki.

**Figure 2.** Figure 2: Per-model bias across six pairs of mutually-exclusive chat conventions, e.g. titles vs. first [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCENE is a new benchmark for implicit norm adaptation and sanction response in group chats, but its synthetic scenarios leave the model comparisons on shaky ground.

read the letter

SCENE is a benchmark for how well LLMs recognize and adapt to implicit social norms in group chats, including how they respond to sanctions when they break those norms. The paper generates plausible scenarios with scripted personas that follow a hidden norm. It creates opportunities for violation and then has the group apply sanctions. Two metrics are defined: responsiveness to negative sanctioning and adapting the norm from peers' behavior. They test six models and find that Claude Opus 4.7 and Gemini 3.1 Pro adapt more than the open-weight models. This moves evaluation toward dynamic, multi-party interactions rather than fixed datasets. The focus on implicit norms and sanctioning is a reasonable extension of prior work on social capabilities. The soft spot is the synthetic nature of the data. The scenarios are generated, not drawn from real chats, so it's unclear if the norms and sanction patterns match what happens in actual human group conversations. The abstract provides no information on validation steps or statistical significance of the results, which makes the performance differences hard to interpret confidently. This paper is for people interested in benchmarks for LLM agents in social settings. A reader looking for new evaluation ideas in conversational AI would find it useful to consider. It deserves peer review. The benchmark concept is worth referee input on how to strengthen the connection to real-world behavior.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SCENE, a benchmark for evaluating LLMs' recognition of and adaptation to implicit social norms and sanctioning in multi-party online group chats. It generates synthetic non-roleplay scenarios using scripted personas that follow a hidden norm, create violation opportunities for the evaluated agent, and apply sanctions upon breaches. Two behavioral metrics are proposed: responsiveness to negative sanctioning and norm adaptation from peer behavior. Evaluation of six frontier and open-weight models shows that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the open-weight models tested. The work positions SCENE as a step toward dynamic, interactional evaluation of LLM social capabilities.

Significance. If the synthetic scenarios prove to capture the subtlety, context-dependence, and sanction intensity of authentic human group chats, SCENE would offer a useful tool for quantifying differences in social norm adaptation between model classes. This directly addresses recent calls for moving beyond static benchmarks toward interaction-based assessments of LLM social intelligence. The contribution is strengthened by its focus on multi-party sanctioning dynamics rather than single-turn norm classification.

major comments (3)

[Abstract / SCENE benchmark construction] The central claim that Claude Opus 4.7 and Gemini 3.1 Pro adapt significantly more than open-weight models rests on the unvalidated assumption that scenarios generated with scripted personas accurately proxy real implicit norms and sanctioning in human group chats (see abstract description of benchmark construction). No details are supplied on scenario validation, such as human annotation of norm subtlety, comparison to real chat corpora, or controls for context-dependence, so observed model differences risk being artifacts of the synthetic setup rather than evidence of superior norm recognition.
[Abstract / Evaluation metrics] The proposed metrics (responsiveness to negative sanctioning; adapting norm from peer behavior) are defined relative to the scripted interactions, yet the abstract supplies no information on their exact computation, aggregation across scenarios, variance, or statistical significance testing. Without these, the qualitative claim of 'significantly more' adaptation cannot be assessed and remains unsupported by visible evidence.
[Abstract / Model evaluation] The evaluation reports results on six models but provides no information on the number of scenarios, scenario diversity, prompt sensitivity controls, or inter-scenario consistency. These omissions are load-bearing because the performance gap between closed and open models could be driven by a small or non-representative set of synthetic cases.

minor comments (2)

[Abstract] The abstract refers to 'six frontier and open-weight models' without naming them or their versions; listing the specific models evaluated would improve immediate clarity.
[Evaluation section] Consider adding a results table with per-model metric scores, standard deviations, and example scenario traces to make the quantitative claims more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying our design choices while committing to revisions that improve transparency and address the noted gaps. We believe the controlled synthetic construction of SCENE enables reproducible measurement of adaptation behaviors that would be difficult to isolate in real chat data.

read point-by-point responses

Referee: [Abstract / SCENE benchmark construction] The central claim that Claude Opus 4.7 and Gemini 3.1 Pro adapt significantly more than open-weight models rests on the unvalidated assumption that scenarios generated with scripted personas accurately proxy real implicit norms and sanctioning in human group chats (see abstract description of benchmark construction). No details are supplied on scenario validation, such as human annotation of norm subtlety, comparison to real chat corpora, or controls for context-dependence, so observed model differences risk being artifacts of the synthetic setup rather than evidence of superior norm recognition.

Authors: We agree that the abstract does not detail validation procedures, which leaves the generalizability of results open to question. SCENE is deliberately constructed as a synthetic benchmark using scripted personas to create fully observable, controllable hidden norms and sanction sequences; this design choice prioritizes internal validity and reproducibility over immediate ecological validity. The full manuscript (Section 3) explains the generation pipeline and the rationale for avoiding real chat data due to privacy, annotation cost, and confounding factors. We will revise the manuscript by adding an explicit limitations subsection that acknowledges the synthetic nature, includes a plan for future human validation (e.g., expert annotation of norm subtlety on a held-out scenario sample and qualitative alignment checks against public chat logs), and clarifies that current claims are scoped to the controlled setting. This will make the positioning of SCENE as an initial step toward interactional evaluation more precise. revision: yes
Referee: [Abstract / Evaluation metrics] The proposed metrics (responsiveness to negative sanctioning; adapting norm from peer behavior) are defined relative to the scripted interactions, yet the abstract supplies no information on their exact computation, aggregation across scenarios, variance, or statistical significance testing. Without these, the qualitative claim of 'significantly more' adaptation cannot be assessed and remains unsupported by visible evidence.

Authors: The metrics are formally defined and operationalized in Section 4 of the manuscript, with responsiveness quantified as the post-sanction compliance rate and norm adaptation as the change in violation frequency after peer exposure. Aggregation uses mean performance across scenarios with accompanying variance. We accept that the abstract omits these computational details and the supporting statistical tests. In the revision we will (1) add a concise description of metric computation to the abstract, (2) report standard deviations and inter-scenario variance in the results, and (3) include appropriate statistical comparisons (e.g., t-tests or non-parametric equivalents with p-values) between model classes to substantiate the 'significantly more' statement. These changes will render the evidence traceable from the abstract onward. revision: yes
Referee: [Abstract / Model evaluation] The evaluation reports results on six models but provides no information on the number of scenarios, scenario diversity, prompt sensitivity controls, or inter-scenario consistency. These omissions are load-bearing because the performance gap between closed and open models could be driven by a small or non-representative set of synthetic cases.

Authors: The experimental setup, including the total number of scenarios, coverage of distinct implicit norms and topics, and controls for prompt variation, is described in the methods and evaluation sections of the full manuscript. We acknowledge that these parameters are not summarized in the abstract, which can make the scale and robustness of the evaluation difficult to assess at a glance. We will revise the abstract to include a brief statement of the evaluation scale and diversity, and we will add a short paragraph in the results section that reports prompt-sensitivity checks and per-norm consistency metrics. These additions will directly address concerns that the observed closed- versus open-model gap might be driven by a narrow or unrepresentative scenario set. revision: yes

Circularity Check

0 steps flagged

No circularity in SCENE benchmark evaluation

full rationale

The paper introduces SCENE as a new benchmark that generates synthetic multi-party chat scenarios using scripted personas to test LLM adaptation to implicit norms and sanctioning. It defines behavioral metrics (responsiveness to negative sanctioning and adapting from peer behavior) directly from these generated interactions and reports empirical results comparing frontier and open-weight models. No equations, parameter fitting, derivations, or self-citations are present that reduce any claimed result to its own inputs by construction. The evaluation chain is self-contained as an independent benchmark assessment rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work introduces a new benchmark without mathematical derivations, fitted parameters, or postulated entities. It relies on the domain assumption that scripted chat scenarios can proxy real social norms.

axioms (1)

domain assumption Scripted personas following hidden norms produce plausible sanctioning behavior representative of human groups.
Invoked in the benchmark generation process described in the abstract.

pith-pipeline@v0.9.0 · 5464 in / 1093 out tokens · 27375 ms · 2026-05-11T02:16:03.317346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

1967 , publisher =

Interaction Ritual: Essays on Face-to-Face Behavior , author =. 1967 , publisher =

work page 1967
[2]

2006 , publisher =

The Grammar of Society: The Nature and Dynamics of Social Norms , author =. 2006 , publisher =

work page 2006
[3]

1987 , publisher =

Politeness: Some Universals in Language Usage , author =. 1987 , publisher =

work page 1987
[4]

Language , volume =

A Simplest Systematics for the Organization of Turn-Taking for Conversation , author =. Language , volume =

work page
[5]

2007 , publisher =

Sequence Organization in Interaction: A Primer in Conversation Analysis , author =. 2007 , publisher =

work page 2007
[6]

American Anthropologist , volume =

Sequencing in Conversational Openings , author =. American Anthropologist , volume =

work page
[7]

Social Problems , volume =

On the Sequential Organization of Troubles-Talk in Ordinary Conversation , author =. Social Problems , volume =

work page
[8]

Rashidi, Yasmeen and Kapadia, Apu and Nippert-Eng, Christena and Su, Norman Makoto , journal =

work page
[9]

2025 , doi =

Beadle, Kyle and Warner, Mark and Vasek, Marie , journal =. 2025 , doi =

work page 2025
[10]

The Internet's Hidden Rules: An Empirical Study of

Chandrasekharan, Eshwar and Samory, Mattia and Jhaver, Shagun and Charvat, Hunter and Bruckman, Amy and Lampe, Cliff and Eisenstein, Jacob and Gilbert, Eric , journal =. The Internet's Hidden Rules: An Empirical Study of. 2018 , doi =

work page 2018
[11]

Annual Review of Psychology , volume =

Ostracism , author =. Annual Review of Psychology , volume =

work page
[12]

Multi- party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835,

Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models , author =. arXiv preprint arXiv:2304.13835 , year =

work page arXiv
[13]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Addressee and Response Selection for Multi-Party Conversation , author =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2016 , doi =

work page 2016
[14]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

work page
[15]

Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =

work page
[16]

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =

work page
[17]

Choi, Minje and Pei, Jiaxin and Kumar, Sagar and Shu, Chang and Jurgens, David , booktitle =. Do

work page
[18]

Nie, Allen and Zhang, Yuhui and Amdekar, Atharva and Piech, Chris and Hashimoto, Tatsunori and Gerstenberg, Tobias , booktitle =

work page
[19]

Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and others , journal =

work page
[20]

Ziems, Caleb and Dwivedi-Yu, Jane and Wang, Yi-Chia and Halevy, Alon and Yang, Diyi , booktitle =

work page
[21]

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

A Computational Approach to Politeness with Application to Social Factors , author =. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

work page
[22]

Revisiting the Evaluation of Theory of Mind through Question Answering , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

work page 2019
[23]

He, Yinghui and Wu, Yufan and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao , booktitle =

work page
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Understanding Social Reasoning in Language Models with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[25]

Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Le Bras, Ronan and Kim, Gunhee and Choi, Yejin and Sap, Maarten , booktitle =

work page
[26]

Chen, Zhuang and Wu, Jincenzi and Zhou, Jinfeng and Wen, Bosi and Bi, Guanqun and Jiang, Gongyao and Cao, Yaru and Hu, Mengting and Lai, Yunghwei and Xiong, Zexuan and Huang, Minlie , booktitle =

work page
[27]

Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan , booktitle =

work page
[28]

The Thirteenth International Conference on Learning Representations (ICLR) , year =

Explore Theory of Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning , author =. The Thirteenth International Conference on Learning Representations (ICLR) , year =

work page
[29]

Gu, Yuling and Tafjord, Oyvind and Kim, Hyunwoo and Moore, Jared and Le Bras, Ronan and Clark, Peter and Choi, Yejin , journal =

work page
[30]

and Icard, Thomas and Jurafsky, Dan and Zou, James , title=

Language Models Cannot Reliably Distinguish Belief from Knowledge and Fact , author =. Nature Machine Intelligence , year =. doi:10.1038/s42256-025-01113-8 , note =

work page doi:10.1038/s42256-025-01113-8
[31]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

work page
[32]

Rethinking Theory of Mind Benchmarks for

Wang, Qiaosi and Zhou, Xuhui and Sap, Maarten and Forlizzi, Jodi and Shen, Hong , booktitle =. Rethinking Theory of Mind Benchmarks for

work page
[33]

A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

A Theory of Appropriateness with Applications to Generative Artificial Intelligence , author =. arXiv preprint arXiv:2412.19010 , year =

work page arXiv
[34]

Wang, Ruiyi and Yu, Haofei and Zhang, Wenxin and Qi, Zhengyang and Sap, Maarten and Bisk, Yonatan and Neubig, Graham and Zhu, Hao , booktitle =

work page
[35]

Is This the Real Life?

Zhou, Xuhui and Su, Zhe and Eisape, Tiwalayo and Kim, Hyunwoo and Sap, Maarten , booktitle =. Is This the Real Life?

work page
[36]

2024 , howpublished =

Regulation (. 2024 , howpublished =

work page 2024
[37]

Artificial Intelligence Risk Management Framework (

Tabassi, Elham , institution =. Artificial Intelligence Risk Management Framework (. 2023 , doi =

work page 2023

[1] [1]

1967 , publisher =

Interaction Ritual: Essays on Face-to-Face Behavior , author =. 1967 , publisher =

work page 1967

[2] [2]

2006 , publisher =

The Grammar of Society: The Nature and Dynamics of Social Norms , author =. 2006 , publisher =

work page 2006

[3] [3]

1987 , publisher =

Politeness: Some Universals in Language Usage , author =. 1987 , publisher =

work page 1987

[4] [4]

Language , volume =

A Simplest Systematics for the Organization of Turn-Taking for Conversation , author =. Language , volume =

work page

[5] [5]

2007 , publisher =

Sequence Organization in Interaction: A Primer in Conversation Analysis , author =. 2007 , publisher =

work page 2007

[6] [6]

American Anthropologist , volume =

Sequencing in Conversational Openings , author =. American Anthropologist , volume =

work page

[7] [7]

Social Problems , volume =

On the Sequential Organization of Troubles-Talk in Ordinary Conversation , author =. Social Problems , volume =

work page

[8] [8]

Rashidi, Yasmeen and Kapadia, Apu and Nippert-Eng, Christena and Su, Norman Makoto , journal =

work page

[9] [9]

2025 , doi =

Beadle, Kyle and Warner, Mark and Vasek, Marie , journal =. 2025 , doi =

work page 2025

[10] [10]

The Internet's Hidden Rules: An Empirical Study of

Chandrasekharan, Eshwar and Samory, Mattia and Jhaver, Shagun and Charvat, Hunter and Bruckman, Amy and Lampe, Cliff and Eisenstein, Jacob and Gilbert, Eric , journal =. The Internet's Hidden Rules: An Empirical Study of. 2018 , doi =

work page 2018

[11] [11]

Annual Review of Psychology , volume =

Ostracism , author =. Annual Review of Psychology , volume =

work page

[12] [12]

Multi- party chat: Conversational agents in group settings with humans and models.arXiv preprint arXiv:2304.13835,

Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models , author =. arXiv preprint arXiv:2304.13835 , year =

work page arXiv

[13] [13]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

Addressee and Response Selection for Multi-Party Conversation , author =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2016 , doi =

work page 2016

[14] [14]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

work page

[15] [15]

Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =

work page

[16] [16]

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =

work page

[17] [17]

Choi, Minje and Pei, Jiaxin and Kumar, Sagar and Shu, Chang and Jurgens, David , booktitle =. Do

work page

[18] [18]

Nie, Allen and Zhang, Yuhui and Amdekar, Atharva and Piech, Chris and Hashimoto, Tatsunori and Gerstenberg, Tobias , booktitle =

work page

[19] [19]

Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and others , journal =

work page

[20] [20]

Ziems, Caleb and Dwivedi-Yu, Jane and Wang, Yi-Chia and Halevy, Alon and Yang, Diyi , booktitle =

work page

[21] [21]

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

A Computational Approach to Politeness with Application to Social Factors , author =. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

work page

[22] [22]

Revisiting the Evaluation of Theory of Mind through Question Answering , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =

work page 2019

[23] [23]

He, Yinghui and Wu, Yufan and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao , booktitle =

work page

[24] [24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Understanding Social Reasoning in Language Models with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[25] [25]

Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Le Bras, Ronan and Kim, Gunhee and Choi, Yejin and Sap, Maarten , booktitle =

work page

[26] [26]

Chen, Zhuang and Wu, Jincenzi and Zhou, Jinfeng and Wen, Bosi and Bi, Guanqun and Jiang, Gongyao and Cao, Yaru and Hu, Mengting and Lai, Yunghwei and Xiong, Zexuan and Huang, Minlie , booktitle =

work page

[27] [27]

Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan , booktitle =

work page

[28] [28]

The Thirteenth International Conference on Learning Representations (ICLR) , year =

Explore Theory of Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning , author =. The Thirteenth International Conference on Learning Representations (ICLR) , year =

work page

[29] [29]

Gu, Yuling and Tafjord, Oyvind and Kim, Hyunwoo and Moore, Jared and Le Bras, Ronan and Clark, Peter and Choi, Yejin , journal =

work page

[30] [30]

and Icard, Thomas and Jurafsky, Dan and Zou, James , title=

Language Models Cannot Reliably Distinguish Belief from Knowledge and Fact , author =. Nature Machine Intelligence , year =. doi:10.1038/s42256-025-01113-8 , note =

work page doi:10.1038/s42256-025-01113-8

[31] [31]

Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

Position: Theory of Mind Benchmarks are Broken for Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =

work page

[32] [32]

Rethinking Theory of Mind Benchmarks for

Wang, Qiaosi and Zhou, Xuhui and Sap, Maarten and Forlizzi, Jodi and Shen, Hong , booktitle =. Rethinking Theory of Mind Benchmarks for

work page

[33] [33]

A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

A Theory of Appropriateness with Applications to Generative Artificial Intelligence , author =. arXiv preprint arXiv:2412.19010 , year =

work page arXiv

[34] [34]

Wang, Ruiyi and Yu, Haofei and Zhang, Wenxin and Qi, Zhengyang and Sap, Maarten and Bisk, Yonatan and Neubig, Graham and Zhu, Hao , booktitle =

work page

[35] [35]

Is This the Real Life?

Zhou, Xuhui and Su, Zhe and Eisape, Tiwalayo and Kim, Hyunwoo and Sap, Maarten , booktitle =. Is This the Real Life?

work page

[36] [36]

2024 , howpublished =

Regulation (. 2024 , howpublished =

work page 2024

[37] [37]

Artificial Intelligence Risk Management Framework (

Tabassi, Elham , institution =. Artificial Intelligence Risk Management Framework (. 2023 , doi =

work page 2023