PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies
Pith reviewed 2026-05-20 02:42 UTC · model grok-4.3
The pith
PAVE architecture lets generative agents violate rules legitimately only when justified while deferring to authority and recovering afterward.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAVE agents satisfy four properties simultaneously: legitimate violation only when a trigger justifies it, authority deference where officer instructions override even high legitimacy, bounded scope where violations stay confined to the targeted rule, and recovery where baseline behavior returns once the trigger ends. This holds across three scenarios, four LLM backbones, and focused ablations in the Voville environment.
What carries the argument
The PAVE four-module architecture—Perception for structured context with authority and severity cues, Assessment for five scalars including legitimacy judgment, Verdict for gated compliance or violation, and Emulation for scoped action execution—that together enforces controlled legitimate violation.
If this is right
- Agents handle emergency scenarios such as fire evacuation or authority-directed actions without blanket rule violations.
- Decisions become more structured and interpretable than those of vanilla LLM agents across all four properties.
- Human evaluators rate PAVE agent behavior as more plausible in the tested environments.
- Removing the legitimacy gate causes the system to reproduce the rule-breaking failures seen in standard agents.
Where Pith is reading between the lines
- The design could transfer to autonomous vehicles or service robots that must occasionally ignore standard constraints during rare events.
- Persona-specific thresholds open a route for personalized rule flexibility without retraining the underlying model.
- Explicit scalar outputs create an audit trail useful for safety monitoring in deployed multi-agent systems.
- Scaling the approach to longer-horizon or multi-rule conflicts would test whether the Assessment module remains stable.
Load-bearing premise
Large language models can reliably compute the five assessment scalars, especially the legitimacy judgment checking necessity, proportionality, and absence of alternatives, from the structured context produced by the Perception module.
What would settle it
In controlled Voville runs, observe whether PAVE agents violate rules without any justifying trigger or ignore explicit officer instructions even when legitimacy scores are high.
Figures
read the original abstract
Generative agents based on large language models reproduce believable human behavior in cooperative settings, but how they should reason in situations where rule-breaking may be required, such as fire evacuation or authority-supervised emergency, remains poorly characterized. We propose PAVE (Perception, Assessment, Verdict, Emulation), a novel four-module cognitive architecture that addresses this gap end to end: (i) Perception extracts a structured context with explicit authority distance, peer behaviors, and severity-tagged situational cues; (ii) Assessment scores the context along five scalars including an explicit legitimacy judgment that checks necessity, proportionality, and absence of alternatives; (iii) Verdict decides to comply or violate under a hard legitimacy gate, with a per-agent threshold elicited from the persona; (iv) Emulation enacts the verdict and scopes the violation to the rule the trigger justifies. We instantiate PAVE in Voville, a tile-based traffic environment forked from Smallville, and evaluate across three scenarios, four LLM backbones, and a focused ablation. PAVE agents satisfy four properties simultaneously: legitimate violation (only when a trigger justifies it), authority deference (officer instructions override even high legitimacy), bounded scope (violations confined to the targeted rule), and recovery (baseline restored once the trigger ends). PAVE agents make more structured and interpretable decisions than vanilla across all four properties, and human evaluators rate them as more plausible. Ablating the legitimacy gate reproduces vanilla-like failures. We release Voville, the PAVE prompts and code, and the evaluation pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PAVE, a four-module cognitive architecture (Perception, Assessment, Verdict, Emulation) for generative agents to handle legitimate rule violations in scenarios such as fire evacuations or authority-supervised emergencies. Perception extracts structured context including authority distance and severity cues; Assessment scores five scalars with an explicit legitimacy judgment (necessity, proportionality, absence of alternatives); Verdict applies a hard legitimacy gate using a persona-elicited per-agent threshold; Emulation scopes the action to the justified rule. The architecture is instantiated in the Voville traffic simulation (forked from Smallville) and evaluated across three scenarios, four LLM backbones, and a focused ablation. The central claim is that PAVE agents simultaneously satisfy legitimate violation, authority deference, bounded scope, and recovery, produce more structured decisions than vanilla agents, and receive higher plausibility ratings from human evaluators, with ablation of the legitimacy gate reproducing vanilla failures.
Significance. If the empirical claims hold, the work fills a gap in generative agent reasoning for non-cooperative or emergency settings by offering a modular, interpretable mechanism that enforces controlled flexibility while preserving deference and recovery. The open release of Voville, prompts, code, and evaluation pipeline supports reproducibility and extension in multi-agent systems research.
major comments (2)
- [Assessment module and Evaluation] The central claims rest on the Assessment module reliably producing accurate scores for its five scalars (especially the legitimacy judgment that checks necessity, proportionality, and absence of alternatives) from Perception output. The manuscript provides no quantitative validation of LLM reliability for these scalars, such as human agreement rates, error analysis on edge cases, or inter-rater metrics, leaving the Verdict gate's enforcement of the four properties unverified (see Assessment module description and Evaluation section).
- [Evaluation] The abstract and evaluation claim that PAVE agents satisfy the four properties simultaneously and that ablating the legitimacy gate reproduces vanilla failures, yet no quantitative metrics, error bars, raw decision traces, or per-scenario success rates are reported. This absence makes it impossible to assess effect sizes or consistency across the three scenarios and four LLM backbones.
minor comments (2)
- [Verdict module] The elicitation process for the per-agent legitimacy threshold from the persona description could be specified with an example prompt or procedure to improve reproducibility.
- [Assessment module] Clarify whether the five assessment scalars are normalized or have explicit ranges, and consider adding a summary table of scalar definitions and their roles in the Verdict gate.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with clear indications of planned revisions to strengthen the empirical grounding of the claims.
read point-by-point responses
-
Referee: [Assessment module and Evaluation] The central claims rest on the Assessment module reliably producing accurate scores for its five scalars (especially the legitimacy judgment that checks necessity, proportionality, and absence of alternatives) from Perception output. The manuscript provides no quantitative validation of LLM reliability for these scalars, such as human agreement rates, error analysis on edge cases, or inter-rater metrics, leaving the Verdict gate's enforcement of the four properties unverified (see Assessment module description and Evaluation section).
Authors: We agree that the current manuscript lacks direct quantitative validation of the Assessment module, such as human agreement rates or inter-rater metrics on the five scalars and the legitimacy judgment. The evaluation instead centers on end-to-end simulation outcomes, human plausibility ratings, and the ablation of the legitimacy gate. To address this gap, we will add a new analysis subsection to the Evaluation section. This will include consistency checks across repeated LLM queries for scalar outputs on edge cases and a small-scale human annotation study measuring agreement on a sample of Perception-to-Assessment mappings. These additions will provide explicit verification for the Verdict gate's role in enforcing the four properties. revision: yes
-
Referee: [Evaluation] The abstract and evaluation claim that PAVE agents satisfy the four properties simultaneously and that ablating the legitimacy gate reproduces vanilla failures, yet no quantitative metrics, error bars, raw decision traces, or per-scenario success rates are reported. This absence makes it impossible to assess effect sizes or consistency across the three scenarios and four LLM backbones.
Authors: The manuscript states that PAVE agents satisfy the four properties simultaneously across the three scenarios and four LLM backbones, with the ablation reproducing vanilla-like failures and higher human plausibility ratings. We acknowledge that explicit quantitative metrics, error bars, per-scenario success rates, and raw decision traces are not tabulated in the current version, which limits assessment of consistency and effect sizes. We will revise the Evaluation section to include these: success rates for each property broken down by scenario and LLM backbone, with error bars where applicable, and selected raw decision traces placed in an appendix. This will make the empirical support more transparent and verifiable. revision: yes
Circularity Check
No circularity: PAVE is an empirical engineering architecture, not a closed derivation
full rationale
The paper proposes PAVE as a four-module cognitive architecture (Perception, Assessment, Verdict, Emulation) instantiated in simulation and evaluated across scenarios, LLMs, and ablations. The four claimed properties (legitimate violation, authority deference, bounded scope, recovery) are demonstrated through agent behavior in Voville rather than derived from equations or self-referential definitions. No load-bearing step reduces a prediction to a fitted input or self-citation chain by construction; the central claims rest on LLM prompt engineering and empirical results, which are externally falsifiable via the released code and evaluation pipeline.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-agent legitimacy threshold
axioms (1)
- domain assumption Large language models can produce reliable structured outputs for legitimacy assessment when given explicit context scalars.
invented entities (1)
-
Legitimacy judgment scalar
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can llms negotiate? negotiationarena platform and analysis.arXiv preprint arXiv:2402.05863,
-
[2]
High-dimension human value representation in large language models
Samuel Cahyawijaya, Delong Chen, Yejin Bang, Leila Khalatbari, Bryan Wilie, Ziwei Ji, Etsuko Ishii, and Pascale Fung. High-dimension human value representation in large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long ...
work page 2025
-
[3]
Gian Maria Campedelli, Nicolò Penzo, Massimo Stefan, Roberto Dessì, Marco Guerini, Bruno Lepri, and Jacopo Staiano. I want to break free! persuasion and anti-social behavior of llms in multi-agent settings with social hierarchy.arXiv preprint arXiv:2410.07109,
-
[4]
Simulating opinion dynamics with networks of llm-based agents
Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. Simulating opinion dynamics with networks of llm-based agents. InFindings of the association for computational linguistics: NAACL 2024, pages 3326–3346,
work page 2024
-
[5]
Evaluating Cooperation in LLM Social Groups through Elected Leadership
Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z Leibo, and Zhijing Jin. Evalu- ating cooperation in llm social groups through elected leadership.arXiv preprint arXiv:2604.11721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Do multilingual language models capture differing moral norms?arXiv preprint arXiv:2203.09904,
Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindˇrich Libovický, Alexander Fraser, and Kristian Kersting. Do multilingual language models capture differing moral norms?arXiv preprint arXiv:2203.09904,
-
[7]
David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, and Zhijing Jin. Corrupted by reasoning: Reasoning language models become free-riders in public goods games.arXiv preprint arXiv:2506.23276,
-
[8]
STEER: Assessing the economic rationality of large language models.arXiv preprint arXiv:2402.09552,
Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, and Moshe Tennenholtz. STEER: Assessing the economic rationality of large language models.arXiv preprint arXiv:2402.09552,
-
[9]
Can large language model agents simulate human trust behaviors? arXiv:2402.04559, 2024
Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, et al. Can large language model agents simulate human trust behavior?, 2024.URL https://arxiv. org/abs/2402.04559,
-
[10]
Selfgoal: Your language agents already know how to achieve high-level goals
Ruihan Yang, Jiangjie Chen, Yikai Zhang, Siyu Yuan, Aili Chen, Kyle Richardson, Yanghua Xiao, and Deqing Yang. Selfgoal: Your language agents already know how to achieve high-level goals. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long...
work page 2025
-
[11]
12 Appendix A The Voville Environment We outlined the experimental settings in Section 4 of the main paper. In this appendix, we provide additional details on the VOVILLEenvironment, a 2D tile-based traffic environment forked from the Smallville sandbox of Park et al. [2023]. VOVILLEextends Smallville with three architectural additions: controllable hazar...
work page 2023
-
[12]
However, ℓ remains at 12–14 throughout, because Legitimacy judges that running 15 minutes late fails the necessity, proportionality, and absence-of-alternatives criteria. The Verdict gate returnscomply at every tick, regardless of the elevatedp emp. This illustrates Finding F3.1. 20 Appendix E Implementation Details E.1 Simulation Parameters Each simulati...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.