SCENE: Recognizing Social Norms and Sanctioning in Group Chats
Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3
The pith
SCENE benchmark shows Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit group chat norms significantly better than open-weight models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. It defines behavioral metrics for responsiveness to negative sanctioning and for adapting the norm from peers' behavior. On this benchmark, Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models.
What carries the argument
SCENE benchmark, which constructs dynamic group-chat interactions around an implicit norm, violation opportunities, and scripted peer sanctions to measure adaptation.
Load-bearing premise
The generated scenarios with scripted personas accurately represent real implicit norms and sanctioning behaviors in human group chats.
What would settle it
Comparing the same models' adaptation rates when placed in live human group chats that contain comparable norm violations would falsify the claim if the performance ordering reverses or disappears.
Figures
read the original abstract
Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SCENE, a benchmark for evaluating LLMs' recognition of and adaptation to implicit social norms and sanctioning in multi-party online group chats. It generates synthetic non-roleplay scenarios using scripted personas that follow a hidden norm, create violation opportunities for the evaluated agent, and apply sanctions upon breaches. Two behavioral metrics are proposed: responsiveness to negative sanctioning and norm adaptation from peer behavior. Evaluation of six frontier and open-weight models shows that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the open-weight models tested. The work positions SCENE as a step toward dynamic, interactional evaluation of LLM social capabilities.
Significance. If the synthetic scenarios prove to capture the subtlety, context-dependence, and sanction intensity of authentic human group chats, SCENE would offer a useful tool for quantifying differences in social norm adaptation between model classes. This directly addresses recent calls for moving beyond static benchmarks toward interaction-based assessments of LLM social intelligence. The contribution is strengthened by its focus on multi-party sanctioning dynamics rather than single-turn norm classification.
major comments (3)
- [Abstract / SCENE benchmark construction] The central claim that Claude Opus 4.7 and Gemini 3.1 Pro adapt significantly more than open-weight models rests on the unvalidated assumption that scenarios generated with scripted personas accurately proxy real implicit norms and sanctioning in human group chats (see abstract description of benchmark construction). No details are supplied on scenario validation, such as human annotation of norm subtlety, comparison to real chat corpora, or controls for context-dependence, so observed model differences risk being artifacts of the synthetic setup rather than evidence of superior norm recognition.
- [Abstract / Evaluation metrics] The proposed metrics (responsiveness to negative sanctioning; adapting norm from peer behavior) are defined relative to the scripted interactions, yet the abstract supplies no information on their exact computation, aggregation across scenarios, variance, or statistical significance testing. Without these, the qualitative claim of 'significantly more' adaptation cannot be assessed and remains unsupported by visible evidence.
- [Abstract / Model evaluation] The evaluation reports results on six models but provides no information on the number of scenarios, scenario diversity, prompt sensitivity controls, or inter-scenario consistency. These omissions are load-bearing because the performance gap between closed and open models could be driven by a small or non-representative set of synthetic cases.
minor comments (2)
- [Abstract] The abstract refers to 'six frontier and open-weight models' without naming them or their versions; listing the specific models evaluated would improve immediate clarity.
- [Evaluation section] Consider adding a results table with per-model metric scores, standard deviations, and example scenario traces to make the quantitative claims more transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying our design choices while committing to revisions that improve transparency and address the noted gaps. We believe the controlled synthetic construction of SCENE enables reproducible measurement of adaptation behaviors that would be difficult to isolate in real chat data.
read point-by-point responses
-
Referee: [Abstract / SCENE benchmark construction] The central claim that Claude Opus 4.7 and Gemini 3.1 Pro adapt significantly more than open-weight models rests on the unvalidated assumption that scenarios generated with scripted personas accurately proxy real implicit norms and sanctioning in human group chats (see abstract description of benchmark construction). No details are supplied on scenario validation, such as human annotation of norm subtlety, comparison to real chat corpora, or controls for context-dependence, so observed model differences risk being artifacts of the synthetic setup rather than evidence of superior norm recognition.
Authors: We agree that the abstract does not detail validation procedures, which leaves the generalizability of results open to question. SCENE is deliberately constructed as a synthetic benchmark using scripted personas to create fully observable, controllable hidden norms and sanction sequences; this design choice prioritizes internal validity and reproducibility over immediate ecological validity. The full manuscript (Section 3) explains the generation pipeline and the rationale for avoiding real chat data due to privacy, annotation cost, and confounding factors. We will revise the manuscript by adding an explicit limitations subsection that acknowledges the synthetic nature, includes a plan for future human validation (e.g., expert annotation of norm subtlety on a held-out scenario sample and qualitative alignment checks against public chat logs), and clarifies that current claims are scoped to the controlled setting. This will make the positioning of SCENE as an initial step toward interactional evaluation more precise. revision: yes
-
Referee: [Abstract / Evaluation metrics] The proposed metrics (responsiveness to negative sanctioning; adapting norm from peer behavior) are defined relative to the scripted interactions, yet the abstract supplies no information on their exact computation, aggregation across scenarios, variance, or statistical significance testing. Without these, the qualitative claim of 'significantly more' adaptation cannot be assessed and remains unsupported by visible evidence.
Authors: The metrics are formally defined and operationalized in Section 4 of the manuscript, with responsiveness quantified as the post-sanction compliance rate and norm adaptation as the change in violation frequency after peer exposure. Aggregation uses mean performance across scenarios with accompanying variance. We accept that the abstract omits these computational details and the supporting statistical tests. In the revision we will (1) add a concise description of metric computation to the abstract, (2) report standard deviations and inter-scenario variance in the results, and (3) include appropriate statistical comparisons (e.g., t-tests or non-parametric equivalents with p-values) between model classes to substantiate the 'significantly more' statement. These changes will render the evidence traceable from the abstract onward. revision: yes
-
Referee: [Abstract / Model evaluation] The evaluation reports results on six models but provides no information on the number of scenarios, scenario diversity, prompt sensitivity controls, or inter-scenario consistency. These omissions are load-bearing because the performance gap between closed and open models could be driven by a small or non-representative set of synthetic cases.
Authors: The experimental setup, including the total number of scenarios, coverage of distinct implicit norms and topics, and controls for prompt variation, is described in the methods and evaluation sections of the full manuscript. We acknowledge that these parameters are not summarized in the abstract, which can make the scale and robustness of the evaluation difficult to assess at a glance. We will revise the abstract to include a brief statement of the evaluation scale and diversity, and we will add a short paragraph in the results section that reports prompt-sensitivity checks and per-norm consistency metrics. These additions will directly address concerns that the observed closed- versus open-model gap might be driven by a narrow or unrepresentative scenario set. revision: yes
Circularity Check
No circularity in SCENE benchmark evaluation
full rationale
The paper introduces SCENE as a new benchmark that generates synthetic multi-party chat scenarios using scripted personas to test LLM adaptation to implicit norms and sanctioning. It defines behavioral metrics (responsiveness to negative sanctioning and adapting from peer behavior) directly from these generated interactions and reports empirical results comparing frontier and open-weight models. No equations, parameter fitting, derivations, or self-citations are present that reduce any claimed result to its own inputs by construction. The evaluation chain is self-contained as an independent benchmark assessment rather than a closed logical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scripted personas following hidden norms produce plausible sanctioning behavior representative of human groups.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Interaction Ritual: Essays on Face-to-Face Behavior , author =. 1967 , publisher =
work page 1967
-
[2]
The Grammar of Society: The Nature and Dynamics of Social Norms , author =. 2006 , publisher =
work page 2006
-
[3]
Politeness: Some Universals in Language Usage , author =. 1987 , publisher =
work page 1987
-
[4]
A Simplest Systematics for the Organization of Turn-Taking for Conversation , author =. Language , volume =
-
[5]
Sequence Organization in Interaction: A Primer in Conversation Analysis , author =. 2007 , publisher =
work page 2007
-
[6]
American Anthropologist , volume =
Sequencing in Conversational Openings , author =. American Anthropologist , volume =
-
[7]
On the Sequential Organization of Troubles-Talk in Ordinary Conversation , author =. Social Problems , volume =
-
[8]
Rashidi, Yasmeen and Kapadia, Apu and Nippert-Eng, Christena and Su, Norman Makoto , journal =
-
[9]
Beadle, Kyle and Warner, Mark and Vasek, Marie , journal =. 2025 , doi =
work page 2025
-
[10]
The Internet's Hidden Rules: An Empirical Study of
Chandrasekharan, Eshwar and Samory, Mattia and Jhaver, Shagun and Charvat, Hunter and Bruckman, Amy and Lampe, Cliff and Eisenstein, Jacob and Gilbert, Eric , journal =. The Internet's Hidden Rules: An Empirical Study of. 2018 , doi =
work page 2018
-
[11]
Annual Review of Psychology , volume =
Ostracism , author =. Annual Review of Psychology , volume =
-
[12]
Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models , author =. arXiv preprint arXiv:2304.13835 , year =
-
[13]
Addressee and Response Selection for Multi-Party Conversation , author =. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2016 , doi =
work page 2016
-
[14]
Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
-
[15]
Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =
-
[16]
Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =
-
[17]
Choi, Minje and Pei, Jiaxin and Kumar, Sagar and Shu, Chang and Jurgens, David , booktitle =. Do
-
[18]
Nie, Allen and Zhang, Yuhui and Amdekar, Atharva and Piech, Chris and Hashimoto, Tatsunori and Gerstenberg, Tobias , booktitle =
-
[19]
Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and others , journal =
-
[20]
Ziems, Caleb and Dwivedi-Yu, Jane and Wang, Yi-Chia and Halevy, Alon and Yang, Diyi , booktitle =
-
[21]
A Computational Approach to Politeness with Application to Social Factors , author =. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL) , pages =
-
[22]
Revisiting the Evaluation of Theory of Mind through Question Answering , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages =. 2019 , doi =
work page 2019
-
[23]
He, Yinghui and Wu, Yufan and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao , booktitle =
-
[24]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Understanding Social Reasoning in Language Models with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[25]
Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Le Bras, Ronan and Kim, Gunhee and Choi, Yejin and Sap, Maarten , booktitle =
-
[26]
Chen, Zhuang and Wu, Jincenzi and Zhou, Jinfeng and Wen, Bosi and Bi, Guanqun and Jiang, Gongyao and Cao, Yaru and Hu, Mengting and Lai, Yunghwei and Xiong, Zexuan and Huang, Minlie , booktitle =
-
[27]
Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan , booktitle =
-
[28]
The Thirteenth International Conference on Learning Representations (ICLR) , year =
Explore Theory of Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning , author =. The Thirteenth International Conference on Learning Representations (ICLR) , year =
-
[29]
Gu, Yuling and Tafjord, Oyvind and Kim, Hyunwoo and Moore, Jared and Le Bras, Ronan and Clark, Peter and Choi, Yejin , journal =
-
[30]
and Icard, Thomas and Jurafsky, Dan and Zou, James , title=
Language Models Cannot Reliably Distinguish Belief from Knowledge and Fact , author =. Nature Machine Intelligence , year =. doi:10.1038/s42256-025-01113-8 , note =
-
[31]
Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =
Position: Theory of Mind Benchmarks are Broken for Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning (ICML) , year =
-
[32]
Rethinking Theory of Mind Benchmarks for
Wang, Qiaosi and Zhou, Xuhui and Sap, Maarten and Forlizzi, Jodi and Shen, Hong , booktitle =. Rethinking Theory of Mind Benchmarks for
-
[33]
A Theory of Appropriateness with Applications to Generative Artificial Intelligence , author =. arXiv preprint arXiv:2412.19010 , year =
-
[34]
Wang, Ruiyi and Yu, Haofei and Zhang, Wenxin and Qi, Zhengyang and Sap, Maarten and Bisk, Yonatan and Neubig, Graham and Zhu, Hao , booktitle =
-
[35]
Zhou, Xuhui and Su, Zhe and Eisape, Tiwalayo and Kim, Hyunwoo and Sap, Maarten , booktitle =. Is This the Real Life?
- [36]
-
[37]
Artificial Intelligence Risk Management Framework (
Tabassi, Elham , institution =. Artificial Intelligence Risk Management Framework (. 2023 , doi =
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.