Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
LLM agents show limited alignment with human emotional responses to bureaucratic red tape, performing worse in Eastern cultures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the same single red-tape scenario and emotion-rating instruments used in earlier human studies are given to current LLMs, all models produce responses that align only modestly with the human benchmarks, with noticeably lower agreement for Eastern cultural contexts; standard cultural-prompting techniques do not meaningfully close that gap.
What carries the argument
An evaluation framework that feeds identical red-tape vignettes and Likert-style emotion scales to LLMs under varied cultural prompts and then compares the output distributions against published human-subject baselines.
If this is right
- Public-administration experiments that rely on LLM agents for cross-cultural emotion data will need independent validation before policy conclusions are drawn.
- Cultural prompting alone is unlikely to be a sufficient fix for alignment gaps in this domain.
- Collecting fresh human ratings through the released RAMO interface becomes necessary to retrain or calibrate future models.
Where Pith is reading between the lines
- If the alignment gap persists across additional scenarios, researchers may need entirely new training objectives rather than prompt engineering for culturally sensitive social simulation.
- The same evaluation framework could be reused to test whether newer models close the Eastern-culture gap without changes to the underlying scenario.
- Policymakers considering LLM-based citizen feedback tools should treat Eastern-culture results as especially provisional.
Load-bearing premise
A single red-tape scenario plus the chosen emotion metrics can stand in for the broader range of cross-cultural differences citizens feel toward bureaucratic procedures.
What would settle it
A new human-subject study using the identical scenario, scales, and cultural groups that finds LLM outputs statistically indistinguishable from the fresh human ratings.
Figures
read the original abstract
Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an evaluation framework for assessing LLMs' emotional responses to bureaucratic red tape across cultural contexts. As a pilot study, it applies the framework to a single red-tape scenario, reports that all models exhibit limited alignment with human benchmarks (weaker in Eastern cultures), finds cultural prompting largely ineffective, and introduces the publicly available RAMO interactive interface for simulation and human data collection.
Significance. If validated beyond the pilot, the work could support cost-effective LLM-based simulation of public administration scenarios and highlight current model limitations in cross-cultural emotional modeling. The public RAMO interface is a clear strength for reproducibility and iterative improvement via human data.
major comments (2)
- [Abstract] Abstract and pilot study description: The headline results on limited alignment, weaker Eastern performance, and ineffective cultural prompting derive from a single fixed red-tape scenario. Emotional response patterns may be scenario-dependent (e.g., licensing vs. taxation), so the observed East-West gap and prompting ineffectiveness could be artifacts rather than general properties; additional scenarios or sensitivity checks are required to support the cross-cultural claims.
- [Pilot study] Pilot study section: The abstract states clear empirical findings but provides no details on sample sizes, statistical methods, exact prompting templates, or how emotional responses were quantified and compared against prior human benchmarks. These omissions leave major gaps in assessing whether the data support the alignment conclusions.
minor comments (1)
- [Abstract] The RAMO interface is introduced as a contribution, but the description lacks specifics on its interactive features, data collection workflow, or how it addresses the identified LLM limitations.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and presentation of our pilot study. We have revised the manuscript to improve methodological transparency and to more explicitly qualify the generalizability of results from a single scenario. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract and pilot study description: The headline results on limited alignment, weaker Eastern performance, and ineffective cultural prompting derive from a single fixed red-tape scenario. Emotional response patterns may be scenario-dependent (e.g., licensing vs. taxation), so the observed East-West gap and prompting ineffectiveness could be artifacts rather than general properties; additional scenarios or sensitivity checks are required to support the cross-cultural claims.
Authors: We agree that the reported alignment results, including the East-West difference and prompting effects, are based on a single red-tape scenario, as the abstract and manuscript already state this is a pilot study. The core contribution is the evaluation framework rather than definitive cross-cultural generalizations. In revision we have added an explicit limitations subsection discussing scenario dependence, noting that emotional patterns could vary with other administrative contexts such as taxation. We also performed and report limited sensitivity checks by varying prompt phrasing within the original scenario. Full multi-scenario validation would require new human benchmark data collection, which exceeds the pilot scope; the RAMO interface is provided precisely to support such community-driven extensions. revision: partial
-
Referee: [Pilot study] Pilot study section: The abstract states clear empirical findings but provides no details on sample sizes, statistical methods, exact prompting templates, or how emotional responses were quantified and compared against prior human benchmarks. These omissions leave major gaps in assessing whether the data support the alignment conclusions.
Authors: We acknowledge the lack of sufficient methodological detail in the original pilot study section. The revised manuscript now integrates these elements into the main text: sample sizes drawn from the cited human benchmark studies, the statistical procedures (Pearson correlation and significance testing for alignment scores), the complete prompting templates for each model and cultural condition, and the quantification pipeline (mapping model outputs onto the same valence-arousal and discrete emotion scales used in the human data, including any reliability metrics). These additions were previously only partially available in supplementary materials and are now consolidated for reproducibility. revision: yes
Circularity Check
No circularity: empirical comparison to external human benchmarks
full rationale
The paper performs a direct empirical evaluation of LLM agent outputs against independent human-subject data from prior studies, using a single fixed red-tape vignette as a pilot. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the reported chain. The central claims (limited alignment, weaker Eastern performance, ineffective prompting) are computed from external ground-truth benchmarks rather than reducing to quantities defined inside the paper itself. The single-scenario limitation raises questions of generalizability but does not create circularity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM outputs can be interpreted as emotional responses comparable to human self-reports
- domain assumption Prior human-subject studies provide accurate cross-cultural benchmarks
invented entities (1)
-
RAMO interactive interface
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Yongjin Ahn and Jesse Campbell. 2022. Red Tape, Rule Legitimacy, and Public Service Motivation: Experimental Evidence From Korean Citizens.Administration & Society54 (01 2022), 009539972110690. doi:10.1177/00953997211069046
-
[2]
Sentiment Is Not Stance: Target-Aware Opinion Classification for Political Text Analysis
James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. 2024. Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.Political Analysis32, 4 (2024), 401–416. doi:10.1017/pan. 2024.5
work page doi:10.1017/pan 2024
- [3]
-
[4]
Nicolas Bougie and Narimawa Watanabe. 2025. CitySim: Modeling Urban Be- haviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella (Eds.). Association for Computational Linguisti...
work page 2025
-
[5]
doi:10.18653/v1/2025.emnlp-industry.15
-
[6]
Barry Bozeman. 2012. Multidimensional red tape: A theory coda.International Public Management Journal15, 3 (2012), 245–265
work page 2012
-
[7]
Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. 2024. Simu- lating Opinion Dynamics with Networks of LLM-based Agents. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computa...
-
[8]
Yun-Shiuan Chuang, Krirk Nirunwiroj, Zach Studdiford, Agam Goyal, Vincent V. Frigo, Sijia Yang, Dhavan V. Shah, Junjie Hu, and Timothy T. Rogers. 2024. Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nun...
work page 2024
-
[9]
Fabian Hattke, David Hensel, and Janne Kalucza. 2020. Emotional responses to bureaucratic red tape.Public Administration Review80, 1 (2020), 53–63
work page 2020
-
[10]
EunJeong Hwang, Bodhisattwa Majumder, and Niket Tandon. 2023. Aligning Language Models to User Opinions. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5906–5919. doi:10.18653/v1/2023.findings-emnlp.393
-
[11]
Jiarui Ji, Yang Li, Hongtao Liu, Zhicheng Du, Zhewei Wei, Qi Qi, Weiran Shen, and Yankai Lin. 2024. SRAP-Agent: Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based Agent. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational ...
-
[12]
Shunjiang Jiang, Longwei Wei, and Chenguang Zhang. 2024. Donald Trumps in the Virtual Polls: Simulating and Predicting Public Opinions in Surveys Using Large Language Models. arXiv:2411.01582 doi:10.48550/arXiv.2411.01582
-
[13]
Carolin Kaiser, Jakob Kaiser, Vladimir Manewitsch, Lea Rau, and Rene Schallner
-
[14]
Simulating Human Opinions with Large Language Models: Opportunities and Challenges for Personalized Survey Data Modeling. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct ’25). Association for Computing Machinery, New York, NY, USA, 82–86. doi:10.1145/3708319.3733685
-
[15]
WESLEY KAUFMANN and Mary Feeney. 2013. Beyond the rules: The effect of outcome favourability on red tape perceptions.Public Administration92 (08 2013). doi:10.1111/padm.12049
-
[16]
Wesley Kaufmann, Alex Ingrams, and Daan Jacobs. 2022. Rationale and pro- cess transparency do not reduce perceived red tape: evidence from a survey experiment.International Review of Administrative Sciences88, 4 (2022), 960–976. arXiv:https://doi.org/10.1177/0020852320966037 doi:10.1177/0020852320966037
-
[17]
Dayeon Ki, Rachel Rudinger, Tianyi Zhou, and Marine Carpuat. 2025. Multi- ple LLM Agents Debate for Equitable Cultural Alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar (Eds.). Association for Computatio...
-
[18]
Sanguk Lee, Tai-Quan Peng, Matthew H. Goldberg, Seth A. Rosenthal, John E. Kotcher, Edward W. Maibach, and Anthony Leiserowitz. 2024. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias.PLOS Climate3, 8 (Aug. 2024), e0000429. doi:10. 1371/journal.pclm.0000429
work page 2024
-
[19]
Hao Li, Ruoyuan Gong, and Hao Jiang. 2025. Political Actor Agent: Simulating Legislative System for Roll Call Votes Prediction with Large Language Models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence. doi:10. 1609/aaai.v39i1.32017
work page 2025
- [20]
-
[21]
Yixian Liu, Bert George, Richard M. Walker, and Peiyi Wu. 2026. Revisiting Emotional Responses to Red Tape: A Replication and Extension in Beijing and Hong Kong. (2026). Working paper
work page 2026
-
[22]
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3586183.3606763
-
[23]
Weihong Qi, Hanjia Lyu, and Jiebo Luo. 2025. Representation Bias in Political Sample Simulations with Large Language Models.Companion Proceedings of the ACM on Web Conference(2025), 1264–1267. doi:10.1145/3701716.3715591 WWW ’25
-
[24]
Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, and Ji-Rong Wen. 2025. GenSim: A General Social Simulation Platform with Large Language Model based Agents. arXiv:2410.04360 [cs.MA] https://arxiv.org/abs/ 2410.04360
-
[25]
Cultural bias and cultural alignment of large language models,
Yan Tao, Olga Viberg, Ryan S. Baker, and René F. Kizilcec. 2024. Cultural Bias and Cultural Alignment of Large Language Models.PNAS Nexus3, 9 (2024), pgae346. doi:10.1093/pnasnexus/pgae346
-
[26]
Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. 2024. Systematic Biases in LLM Simulations of Debates.arXiv preprint(2024). https://arxiv.org/ abs/2402.XXXX arXiv:2402.XXXX
work page 2024
-
[27]
The Culture Factor. n.d.. Intercultural Management. https://www.theculturefactor. com/intercultural-management. Accessed: 21 Jan. 2026
work page 2026
-
[28]
Chenxu Wang, Bin Dai, Huaping Liu, and Baoyuan Wang. 2024. Towards Objec- tively Benchmarking Social Intelligence of Language Agents at the Action Level. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 8885–8897. doi:1...
- [29]
-
[30]
Jincenzi Wu, Jianxun Lian, Dingdong Wang, and Helen M. Meng. 2025. SocialCC: Interactive Evaluation for Cultural Competence in Language Agents. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association ...
- [31]
-
[32]
Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, Xing Xie, and Ji-Rong Wen. 2025. TrendSim: Sim- ulating Trending Topics in Social Media Under Poisoning Attacks with LLM- based Multi-agent System. InFindings of the Association for Computational Lin- guistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu...
-
[33]
Beijing Public Security Exit-Entry Mini Program
Sie haben erfolgreich einen Termin für die nächste Woche gebucht. Bitte laden Sie Ihre Dokumente hoch: Reisepass, Heiratsurkunde, Reisepass des Ehepartners, Meldebescheinigung des Ehepartners in Deutschland, A1-Sprachzeugnis und Krankenversicherung. 3. Beim Termin hat der Sachbearbeiter Ihre Unterlagen entgegengenommen. Ihr Visum wird in 2–4 Monaten ferti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.