pith. sign in

arxiv: 2604.12545 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.CY

Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords LLM agentsemotional responsesred tapecross-cultural differencessimulationpublic administrationcultural promptingevaluation framework
0
0 comments X

The pith

LLM agents show limited alignment with human emotional responses to bureaucratic red tape, performing worse in Eastern cultures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can stand in for real citizens when measuring emotional reactions to red tape in policy settings. Prior human studies found clear cultural differences, so the authors build a framework that applies the same red-tape story and the same emotion scales to several LLMs under different cultural prompts. The results indicate that none of the models match human patterns well and that adding culture-specific instructions brings little improvement. This matters because public-administration researchers hope to use cheap LLM simulations instead of repeated human-subject experiments. The work also releases an open interface called RAMO for running such simulations and gathering new human data.

Core claim

When the same single red-tape scenario and emotion-rating instruments used in earlier human studies are given to current LLMs, all models produce responses that align only modestly with the human benchmarks, with noticeably lower agreement for Eastern cultural contexts; standard cultural-prompting techniques do not meaningfully close that gap.

What carries the argument

An evaluation framework that feeds identical red-tape vignettes and Likert-style emotion scales to LLMs under varied cultural prompts and then compares the output distributions against published human-subject baselines.

If this is right

  • Public-administration experiments that rely on LLM agents for cross-cultural emotion data will need independent validation before policy conclusions are drawn.
  • Cultural prompting alone is unlikely to be a sufficient fix for alignment gaps in this domain.
  • Collecting fresh human ratings through the released RAMO interface becomes necessary to retrain or calibrate future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the alignment gap persists across additional scenarios, researchers may need entirely new training objectives rather than prompt engineering for culturally sensitive social simulation.
  • The same evaluation framework could be reused to test whether newer models close the Eastern-culture gap without changes to the underlying scenario.
  • Policymakers considering LLM-based citizen feedback tools should treat Eastern-culture results as especially provisional.

Load-bearing premise

A single red-tape scenario plus the chosen emotion metrics can stand in for the broader range of cross-cultural differences citizens feel toward bureaucratic procedures.

What would settle it

A new human-subject study using the identical scenario, scales, and cultural groups that finds LLM outputs statistically indistinguishable from the fresh human ratings.

Figures

Figures reproduced from arXiv: 2604.12545 by Jiugeng Sun, Mennatallah El-Assady, Wanchun Ni, Yixian Liu.

Figure 1
Figure 1. Figure 1: Overview of RAMO interface design, detailed in Section 5. Major components are enlarged for visibility and labelled [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cultural factor scores of Mainland China, Germany and Hong Kong SAR. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an evaluation framework for assessing LLMs' emotional responses to bureaucratic red tape across cultural contexts. As a pilot study, it applies the framework to a single red-tape scenario, reports that all models exhibit limited alignment with human benchmarks (weaker in Eastern cultures), finds cultural prompting largely ineffective, and introduces the publicly available RAMO interactive interface for simulation and human data collection.

Significance. If validated beyond the pilot, the work could support cost-effective LLM-based simulation of public administration scenarios and highlight current model limitations in cross-cultural emotional modeling. The public RAMO interface is a clear strength for reproducibility and iterative improvement via human data.

major comments (2)
  1. [Abstract] Abstract and pilot study description: The headline results on limited alignment, weaker Eastern performance, and ineffective cultural prompting derive from a single fixed red-tape scenario. Emotional response patterns may be scenario-dependent (e.g., licensing vs. taxation), so the observed East-West gap and prompting ineffectiveness could be artifacts rather than general properties; additional scenarios or sensitivity checks are required to support the cross-cultural claims.
  2. [Pilot study] Pilot study section: The abstract states clear empirical findings but provides no details on sample sizes, statistical methods, exact prompting templates, or how emotional responses were quantified and compared against prior human benchmarks. These omissions leave major gaps in assessing whether the data support the alignment conclusions.
minor comments (1)
  1. [Abstract] The RAMO interface is introduced as a contribution, but the description lacks specifics on its interactive features, data collection workflow, or how it addresses the identified LLM limitations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our pilot study. We have revised the manuscript to improve methodological transparency and to more explicitly qualify the generalizability of results from a single scenario. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract and pilot study description: The headline results on limited alignment, weaker Eastern performance, and ineffective cultural prompting derive from a single fixed red-tape scenario. Emotional response patterns may be scenario-dependent (e.g., licensing vs. taxation), so the observed East-West gap and prompting ineffectiveness could be artifacts rather than general properties; additional scenarios or sensitivity checks are required to support the cross-cultural claims.

    Authors: We agree that the reported alignment results, including the East-West difference and prompting effects, are based on a single red-tape scenario, as the abstract and manuscript already state this is a pilot study. The core contribution is the evaluation framework rather than definitive cross-cultural generalizations. In revision we have added an explicit limitations subsection discussing scenario dependence, noting that emotional patterns could vary with other administrative contexts such as taxation. We also performed and report limited sensitivity checks by varying prompt phrasing within the original scenario. Full multi-scenario validation would require new human benchmark data collection, which exceeds the pilot scope; the RAMO interface is provided precisely to support such community-driven extensions. revision: partial

  2. Referee: [Pilot study] Pilot study section: The abstract states clear empirical findings but provides no details on sample sizes, statistical methods, exact prompting templates, or how emotional responses were quantified and compared against prior human benchmarks. These omissions leave major gaps in assessing whether the data support the alignment conclusions.

    Authors: We acknowledge the lack of sufficient methodological detail in the original pilot study section. The revised manuscript now integrates these elements into the main text: sample sizes drawn from the cited human benchmark studies, the statistical procedures (Pearson correlation and significance testing for alignment scores), the complete prompting templates for each model and cultural condition, and the quantification pipeline (mapping model outputs onto the same valence-arousal and discrete emotion scales used in the human data, including any reliability metrics). These additions were previously only partially available in supplementary materials and are now consolidated for reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external human benchmarks

full rationale

The paper performs a direct empirical evaluation of LLM agent outputs against independent human-subject data from prior studies, using a single fixed red-tape vignette as a pilot. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the reported chain. The central claims (limited alignment, weaker Eastern performance, ineffective prompting) are computed from external ground-truth benchmarks rather than reducing to quantities defined inside the paper itself. The single-scenario limitation raises questions of generalizability but does not create circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that LLM-generated text can be meaningfully scored for emotional content and compared to human benchmarks, plus the validity of the single-scenario pilot design.

axioms (2)
  • domain assumption LLM outputs can be interpreted as emotional responses comparable to human self-reports
    Invoked when claiming limited alignment between models and human data
  • domain assumption Prior human-subject studies provide accurate cross-cultural benchmarks
    Used as ground truth for evaluating LLM performance
invented entities (1)
  • RAMO interactive interface no independent evidence
    purpose: Tool for simulating citizen emotional responses and collecting new human data
    New system introduced to address the identified gap

pith-pipeline@v0.9.0 · 5467 in / 1454 out tokens · 58135 ms · 2026-05-10T15:17:45.727883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    Yongjin Ahn and Jesse Campbell. 2022. Red Tape, Rule Legitimacy, and Public Service Motivation: Experimental Evidence From Korean Citizens.Administration & Society54 (01 2022), 009539972110690. doi:10.1177/00953997211069046

  2. [2]

    Sentiment Is Not Stance: Target-Aware Opinion Classification for Political Text Analysis

    James Bisbee, Joshua D. Clinton, Cassy Dorff, Brenton Kenkel, and Jennifer M. Larson. 2024. Synthetic Replacements for Human Survey Data? The Perils of Large Language Models.Political Analysis32, 4 (2024), 401–416. doi:10.1017/pan. 2024.5

  3. [3]

    Mina Bishay, Kenneth Preston, Matthew Strafuss, Graham Page, Jay Turcot, and Mohammad Mavadati. 2022. AFFDEX 2.0: A Real-Time Facial Expression Analysis Toolkit. arXiv:2202.12059 [cs.CV] https://arxiv.org/abs/2202.12059

  4. [4]

    Nicolas Bougie and Narimawa Watanabe. 2025. CitySim: Modeling Urban Be- haviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella (Eds.). Association for Computational Linguisti...

  5. [5]

    doi:10.18653/v1/2025.emnlp-industry.15

  6. [6]

    Barry Bozeman. 2012. Multidimensional red tape: A theory coda.International Public Management Journal15, 3 (2012), 245–265

  7. [7]

    Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, and Timothy Rogers. 2024. Simu- lating Opinion Dynamics with Networks of LLM-based Agents. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computa...

  8. [8]

    Frigo, Sijia Yang, Dhavan V

    Yun-Shiuan Chuang, Krirk Nirunwiroj, Zach Studdiford, Agam Goyal, Vincent V. Frigo, Sijia Yang, Dhavan V. Shah, Junjie Hu, and Timothy T. Rogers. 2024. Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nun...

  9. [9]

    Fabian Hattke, David Hensel, and Janne Kalucza. 2020. Emotional responses to bureaucratic red tape.Public Administration Review80, 1 (2020), 53–63

  10. [10]

    EunJeong Hwang, Bodhisattwa Majumder, and Niket Tandon. 2023. Aligning Language Models to User Opinions. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5906–5919. doi:10.18653/v1/2023.findings-emnlp.393

  11. [11]

    Jiarui Ji, Yang Li, Hongtao Liu, Zhicheng Du, Zhewei Wei, Qi Qi, Weiran Shen, and Yankai Lin. 2024. SRAP-Agent: Simulating and Optimizing Scarce Resource Allocation Policy with LLM-based Agent. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational ...

  12. [12]

    Shunjiang Jiang, Longwei Wei, and Chenguang Zhang. 2024. Donald Trumps in the Virtual Polls: Simulating and Predicting Public Opinions in Surveys Using Large Language Models. arXiv:2411.01582 doi:10.48550/arXiv.2411.01582

  13. [13]

    Carolin Kaiser, Jakob Kaiser, Vladimir Manewitsch, Lea Rau, and Rene Schallner

  14. [14]

    Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization , pages =

    Simulating Human Opinions with Large Language Models: Opportunities and Challenges for Personalized Survey Data Modeling. InAdjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct ’25). Association for Computing Machinery, New York, NY, USA, 82–86. doi:10.1145/3708319.3733685

  15. [15]

    WESLEY KAUFMANN and Mary Feeney. 2013. Beyond the rules: The effect of outcome favourability on red tape perceptions.Public Administration92 (08 2013). doi:10.1111/padm.12049

  16. [16]

    Wesley Kaufmann, Alex Ingrams, and Daan Jacobs. 2022. Rationale and pro- cess transparency do not reduce perceived red tape: evidence from a survey experiment.International Review of Administrative Sciences88, 4 (2022), 960–976. arXiv:https://doi.org/10.1177/0020852320966037 doi:10.1177/0020852320966037

  17. [17]

    Dayeon Ki, Rachel Rudinger, Tianyi Zhou, and Marine Carpuat. 2025. Multi- ple LLM Agents Debate for Equitable Cultural Alignment. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moham- mad Taher Pilehvar (Eds.). Association for Computatio...

  18. [18]

    Goldberg, Seth A

    Sanguk Lee, Tai-Quan Peng, Matthew H. Goldberg, Seth A. Rosenthal, John E. Kotcher, Edward W. Maibach, and Anthony Leiserowitz. 2024. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias.PLOS Climate3, 8 (Aug. 2024), e0000429. doi:10. 1371/journal.pclm.0000429

  19. [19]

    Hao Li, Ruoyuan Gong, and Hao Jiang. 2025. Political Actor Agent: Simulating Legislative System for Roll Call Votes Prediction with Large Language Models. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence. doi:10. 1609/aaai.v39i1.32017

  20. [20]

    Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. 2024. EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities. arXiv:2310.10436 [cs.AI] https://arxiv.org/abs/2310.10436

  21. [21]

    Walker, and Peiyi Wu

    Yixian Liu, Bert George, Richard M. Walker, and Peiyi Wu. 2026. Revisiting Emotional Responses to Red Tape: A Replication and Extension in Beijing and Hong Kong. (2026). Working paper

  22. [22]

    O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

    Joon Sung Park, Joseph C. O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3586183.3606763

  23. [23]

    Weihong Qi, Hanjia Lyu, and Jiebo Luo. 2025. Representation Bias in Political Sample Simulations with Large Language Models.Companion Proceedings of the ACM on Web Conference(2025), 1264–1267. doi:10.1145/3701716.3715591 WWW ’25

  24. [24]

    Jiakai Tang, Heyang Gao, Xuchen Pan, Lei Wang, Haoran Tan, Dawei Gao, Yushuo Chen, Xu Chen, Yankai Lin, Yaliang Li, Bolin Ding, Jingren Zhou, Jun Wang, and Ji-Rong Wen. 2025. GenSim: A General Social Simulation Platform with Large Language Model based Agents. arXiv:2410.04360 [cs.MA] https://arxiv.org/abs/ 2410.04360

  25. [25]

    Cultural bias and cultural alignment of large language models,

    Yan Tao, Olga Viberg, Ryan S. Baker, and René F. Kizilcec. 2024. Cultural Bias and Cultural Alignment of Large Language Models.PNAS Nexus3, 9 (2024), pgae346. doi:10.1093/pnasnexus/pgae346

  26. [26]

    Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. 2024. Systematic Biases in LLM Simulations of Debates.arXiv preprint(2024). https://arxiv.org/ abs/2402.XXXX arXiv:2402.XXXX

  27. [27]

    The Culture Factor. n.d.. Intercultural Management. https://www.theculturefactor. com/intercultural-management. Accessed: 21 Jan. 2026

  28. [28]

    Chenxu Wang, Bin Dai, Huaping Liu, and Baoyuan Wang. 2024. Towards Objec- tively Benchmarking Social Intelligence of Language Agents at the Action Level. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 8885–8897. doi:1...

  29. [29]

    Chenxi Wang, Zongfang Liu, Dequan Yang, and Xiuying Chen. 2025. Decoding Echo Chambers: LLM-Powered Simulations Revealing Polarization in Social Networks. arXiv:2409.19338 [cs.SI] https://arxiv.org/abs/2409.19338

  30. [30]

    Jincenzi Wu, Jianxun Lian, Dingdong Wang, and Helen M. Meng. 2025. SocialCC: Interactive Evaluation for Cultural Competence in Language Agents. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association ...

  31. [31]

    Chenxiao Yu, Jinyi Ye, Yuangang Li, Zheng Li, Emilio Ferrara, Xiyang Hu, and Yue Zhao. 2025. A Large-Scale Simulation on Large Language Models for Decision- Making in Political Science. arXiv:2412.15291 [cs.CL] https://arxiv.org/abs/2412. 15291

  32. [32]

    北京公安出境小程序

    Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, Xing Xie, and Ji-Rong Wen. 2025. TrendSim: Sim- ulating Trending Topics in Social Media Under Poisoning Attacks with LLM- based Multi-agent System. InFindings of the Association for Computational Lin- guistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu...

  33. [33]

    Beijing Public Security Exit-Entry Mini Program

    Sie haben erfolgreich einen Termin für die nächste Woche gebucht. Bitte laden Sie Ihre Dokumente hoch: Reisepass, Heiratsurkunde, Reisepass des Ehepartners, Meldebescheinigung des Ehepartners in Deutschland, A1-Sprachzeugnis und Krankenversicherung. 3. Beim Termin hat der Sachbearbeiter Ihre Unterlagen entgegengenommen. Ihr Visum wird in 2–4 Monaten ferti...