arxiv: 2604.15607 · v1 · submitted 2026-04-17 · 💻 cs.CL · cs.AI· cs.CY· cs.HC

Recognition: unknown

Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

Myke C. Cohen , Mingqian Zheng , Neel Bhandari , Hsien-Te Kao , Xuhui Zhou , Daniel Nguyen , Laura Cassani , Maarten Sap

show 1 more author

Svitlana Volkova

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.HC

keywords human-AI interactionimperfect cooperationcausal discoveryAI transparencypersonality traitssimulation validationuser studiesnegotiation scenarios

0 comments

The pith

AI transparency influences real human interactions more than personality traits, unlike in simulations where both factors weigh similarly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure how human personality traits and AI design characteristics jointly shape outcomes when people and AI pursue only partially aligned goals. It runs the same two scenario types—hiring negotiations and transactions that allow information concealment—once with 2,000 simulated agents and once with 290 human participants. Causal discovery applied to performance, communication, and questionnaire data shows that simulations treat extraversion, agreeableness, adaptability, expertise, and transparency as roughly comparable drivers, while actual humans are far more responsive to the AI attributes, especially transparency. This matters because many AI systems are tuned and evaluated in simulation before meeting real users. The observed mismatch implies that design choices based only on simulated data may misallocate effort away from the features people actually notice.

Core claim

Causal discovery on the paired datasets establishes that personality traits and AI attributes exert comparable influence on interaction quality in the simulated imperfectly cooperative scenarios, yet in the parallel human-subject experiments AI attributes—particularly chain-of-thought transparency—become markedly more impactful, with additional variation across the hiring and transaction scenario categories.

What carries the argument

Causal discovery analysis that jointly models scenario outcomes, communication traces, and questionnaire measures to isolate relative causal contributions of Extraversion, Agreeableness, Adaptability, Expertise, and Transparency.

If this is right

AI agents intended for real imperfectly cooperative settings should prioritize explicit reasoning transparency over other design attributes.
Personality-trait matching or adaptation may deliver smaller returns in actual use than simulation results suggest.
Separate design guidelines are needed for negotiation versus transactional contexts because the two scenario types produce distinct causal patterns.
Evaluations limited to task performance miss the communication and relational effects captured by the integrated analysis.
Development pipelines that rely exclusively on simulation risk over-weighting features whose impact shrinks when humans are involved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The simulation-reality gap suggests that transparency mechanisms should be prototyped and measured directly with users rather than tuned inside simulated environments.
Similar mismatches may appear in other high-stakes domains such as medical or legal assistance, where users must decide how much to trust an AI whose goals are only partially aligned.
One testable extension is whether making specific internal states visible (for example, confidence estimates or alternative options considered) produces measurable gains in cooperation metrics.
The work points toward a broader need for hybrid evaluation protocols that treat human-subject data as the primary source for causal claims about user-facing AI.

Load-bearing premise

The causal discovery procedure correctly attributes outcome differences to measured personality traits versus AI attributes without unmeasured confounders or systematic differences between how simulated agents and real humans form decisions.

What would settle it

Repeating the human experiment with a new participant pool or applying a different causal inference technique that yields no reliable difference in the relative strength of transparency versus personality would undermine the reported divergence.

Figures

Figures reproduced from arXiv: 2604.15607 by Daniel Nguyen, Hsien-Te Kao, Laura Cassani, Maarten Sap, Mingqian Zheng, Myke C. Cohen, Neel Bhandari, Svitlana Volkova, Xuhui Zhou.

**Figure 1.** Figure 1: Dual-framework study design for evaluating imperfectly cooperative human-AI interactions: (1) simulated [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Significant causal effects (|SEM Weight| > 0.1), per intervention (y-axis) and outcome measure group (x-axis). Unique shapes correspond to each scenario setup. Shape sizes represent average causal effect strengths (average of absolute SEM weights), while shape colors represent effect directionality (average of raw SEM weights). 4 Results We summarize the results of causal analyses for the Hiring Negotiatio… view at source ↗

**Figure 3.** Figure 3: Pearson correlations between User Study LLM- and survey-based evaluations, normalized to a 0-1 scale. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Screenshot of User Study Overview, Instructions, and Consent Information [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Screenshots of the chat interface 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Screenshots of the Post-Study Survey 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of survey-based evaluation metrics. [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Heatmaps for High-stakes Negotation causal SEM weights in our (a) simulation study and (b) user study. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Heatmaps for Low-stakes Negotation causal SEM weights in our (a) simulation study and (b) user study. [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Heatmaps for AI-LieDar Benefits causal SEM weights in our (a) simulation study and (b) user study. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Heatmaps for AI-LieDar Public Image causal SEM weights in our (a) simulation study and (b) user study. [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Heatmaps for AI-LieDar Emotion causal SEM weights in our (a) simulation study and (b) user study. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

read the original abstract

AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes -- particularly transparency -- were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript compares a simulated dataset of 2,000 human-AI interactions with a human subjects experiment involving 290 participants in two imperfectly cooperative scenarios: hiring negotiations and transactions with potential information concealment. Using causal discovery, it examines the relative impacts of human personality traits (Extraversion and Agreeableness) and AI design attributes (Adaptability, Expertise, and Transparency), reporting divergences where simulations show balanced influence but human data highlight AI attributes, particularly transparency, as more impactful.

Significance. If the reported divergences hold under rigorous validation, the work is significant for highlighting the gap between simulated and real human-AI interactions in non-fully cooperative settings. It suggests that AI transparency plays an outsized role in actual user experiences, informing the design of AI agents that better account for human factors in imperfect cooperation. The integration of scenario outcomes, communication analysis, and questionnaires extends beyond performance metrics.

major comments (3)

Abstract: The abstract states the main finding but supplies no details on statistical controls, exclusion criteria, effect sizes, or how causal discovery was validated, making it impossible to judge whether the reported divergences are supported by the data.
Results (causal discovery analysis): The analysis does not specify the algorithm used (e.g., PC, GES, NOTEARS) or provide sensitivity checks for unmeasured confounders such as trust, risk aversion, or prior AI exposure, which could open back-door paths and undermine the claim that AI attributes are more impactful in human data compared to simulations.
Methods: There is no explicit comparison of conditional independence structures between the simulated and human datasets, which is necessary to ensure that differences reflect attribute effects rather than mismatches in decision processes.

minor comments (1)

Abstract: Consider adding a brief mention of the sample sizes (2,000 simulations and 290 participants) for immediate context on scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below and have revised the manuscript accordingly to strengthen the presentation of our methods, results, and abstract.

read point-by-point responses

Referee: Abstract: The abstract states the main finding but supplies no details on statistical controls, exclusion criteria, effect sizes, or how causal discovery was validated, making it impossible to judge whether the reported divergences are supported by the data.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to assess the findings more readily. In the revised version, we have expanded the abstract to reference the sample sizes (2,000 simulations and 290 participants), note the use of causal discovery with validation via sensitivity analyses, and indicate key effect size patterns. Full details on exclusion criteria (attention checks and incomplete responses), statistical controls, and causal discovery validation procedures remain in the Methods and Results sections, but we now include a brief reference to them in the abstract to improve accessibility without exceeding length constraints. revision: yes
Referee: Results (causal discovery analysis): The analysis does not specify the algorithm used (e.g., PC, GES, NOTEARS) or provide sensitivity checks for unmeasured confounders such as trust, risk aversion, or prior AI exposure, which could open back-door paths and undermine the claim that AI attributes are more impactful in human data compared to simulations.

Authors: We appreciate this observation and have revised the Results section to explicitly name the causal discovery algorithm employed (NOTEARS, chosen for its suitability with mixed continuous and categorical variables in our interaction data). We have also added sensitivity analyses that incorporate available proxy measures for potential unmeasured confounders, including trust (from post-interaction questionnaires), risk aversion (from validated scales), and prior AI exposure (from demographic items). These checks demonstrate that the core finding—greater impact of AI attributes, especially transparency, in human data—remains stable. We acknowledge that not all possible confounders can be fully ruled out and discuss this limitation in the revised text. revision: yes
Referee: Methods: There is no explicit comparison of conditional independence structures between the simulated and human datasets, which is necessary to ensure that differences reflect attribute effects rather than mismatches in decision processes.

Authors: We concur that such a comparison strengthens the validity of cross-dataset inferences. The revised Methods and Results sections now include an explicit side-by-side analysis of the conditional independence structures discovered in the simulated versus human datasets. This takes the form of a supplementary table listing shared and divergent independence relations, accompanied by discussion showing that the primary divergences in causal influence (e.g., transparency dominance in human data) align with attribute effects rather than fundamental differences in underlying decision processes. This addition helps substantiate the reported simulation-human gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of simulated and human datasets

full rationale

The paper performs an empirical study comparing a 2,000-simulation dataset against data from 290 human participants across two scenario types, using causal discovery to assess relative impacts of personality traits (Extraversion, Agreeableness) and AI attributes (Adaptability, Expertise, Transparency). No derivations, equations, or parameter fittings are presented as independent predictions; results follow directly from applying standard causal discovery methods to the collected measures and outcomes. The work contains no self-definitional constructs, fitted-input-as-prediction steps, or load-bearing self-citations that reduce the central claims to tautologies. The analysis is self-contained against external benchmarks of the two datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard assumptions from causal inference and experimental psychology without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Causal discovery algorithms can recover meaningful causal structure from the collected simulation and questionnaire data
Invoked when the abstract states that causal discovery analysis extends performance-focused evaluations.

pith-pipeline@v0.9.0 · 5552 in / 1301 out tokens · 48949 ms · 2026-05-10T09:36:00.030339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 55 canonical work pages · 5 internal anchors

[1]

Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R

William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, and Kevin R. McKee. 2024. https://doi.org/10.1145/3613904.3642703 The illusion of artificial inclusion . In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI ’24, page 1–12. ACM

work page doi:10.1145/3613904.3642703 2024
[2]

Evgeni Aizenberg, Matthew J Dennis, and Jeroen van den Hoven. 2025. Examining the assumptions of ai hiring assessments and their impact on job seekers’ autonomy over self-representation. AI & society, 40(2):919--927

2025
[3]

Marie-Noelle Albert and Salah Koubaa. 2025. https://doi.org/10.3389/fhumd.2025.1554731 The coopetition of human intelligence and artificial intelligence through the prism of irrationality . Frontiers in Human Dynamics, 7. Publisher: Frontiers

work page doi:10.3389/fhumd.2025.1554731 2025
[4]

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, and Animesh Mukherjee. 2021. https://doi.org/10.1007/978-3-030-67670-4_26 A Deep Dive into Multilingual Hate Speech Classification . In Machine Learning and Knowledge Discovery in Databases . Applied Data Science and Demo Track , pages 423--439, Cham. Springer International Publishing

work page doi:10.1007/978-3-030-67670-4_26 2021
[5]

Flexible Coding of in-depth Interviews: A Twenty- rst Century Approach

Lisa P. Argyle, Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. 2023. https://doi.org/10.1017/pan.2023.2 Out of one, many: Using language models to simulate human samples . Political Analysis, 31(3):337–351

work page doi:10.1017/pan.2023.2 2023
[6]

Bach, Jenny K

Tita A. Bach, Jenny K. Kristiansen, Aleksandar Babic, and Alon Jacovi. 2024. https://doi.org/10.1109/ACCESS.2024.3437190 Unpacking Human - AI Interaction in Safety - Critical Industries : A Systematic Literature Review . IEEE Access, 12:106385--106414. Conference Name: IEEE Access

work page doi:10.1109/access.2024.3437190 2024
[7]

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. https://arxiv.org/abs/2406.11896 Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning . Preprint, arXiv:2406.11896

work page arXiv 2024
[8]

Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-AI teams: Understanding and addressing the performance/compatibility tradeoff. Proc. Conf. AAAI Artif. Intell., 33(01):2429--2437

2019
[9]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2020. https://doi.org/10.1016/j.inffus.2019.12.012 Explainable Artificial Intelligence ( XAI ): Concepts , taxonomies, opportunities and ...

work page doi:10.1016/j.inffus.2019.12.012 2020
[10]

Paul Beaumont, Ben Horsburgh, Philip Pilgerstorfer, Angel Droth, Richard Oentaryo, Steven Ler, Hiep Nguyen, Gabriel Azevedo Ferreira, Zain Patel, and Wesley Leong. 2021. CausalNex

2021
[11]

Graham Caron and Shashank Srivastava. 2022. https://doi.org/10.48550/arXiv.2212.10276 Identifying and Manipulating the Personality Traits of Language Models . Preprint, arXiv:2212.10276

work page doi:10.48550/arxiv.2212.10276 2022
[12]

Chiou and John D

Erin K. Chiou and John D. Lee. 2023. https://doi.org/10.1177/00187208211009995 Trusting Automation : Designing for Responsivity and Resilience . Human Factors: The Journal of the Human Factors and Ergonomics Society, 65(1):137--165

work page doi:10.1177/00187208211009995 2023
[13]

Nazli Cila. 2022. Designing human-agent collaborations: Commitment, responsiveness, and support. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1--18

2022
[14]

Cohen, Matthew A

Myke C. Cohen, Matthew A. Peel, Matthew J. Scalia, Matthew M. Willett, Erin K. Chiou, Jamie C. Gorman, and Nancy J. Cooke. 2023. https://doi.org/10.1177/21695067231196240 Anthropomorphism Moderates the Relationships of Dispositional , Perceptual , and Behavioral Trust in a Robot Teammate . In Proceedings of the Human Factors and Ergonomics Society Annual ...

work page doi:10.1177/21695067231196240 2023
[15]

Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, and Svitlana Volkova

Myke C. Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, and Svitlana Volkova. 2025. https://doi.org/10.48550/arXiv.2506.15928 Exploring Big Five Personality and AI Capability Effects in LLM - Simulated Negotiation Dialogues . arXiv preprint. ArXiv:2506.15928 [cs]

work page doi:10.48550/arxiv.2506.15928 2025
[16]

Cooke, Myke C

Nancy J. Cooke, Myke C. Cohen, Walter C. Fazio, Laura H. Inderberg, Craig J. Johnson, Glenn J. Lematta, Matthew Peel, and Aaron Teo. 2024. https://doi.org/10.1177/00187208231162449 From Teams to Teamness : Future Directions in the Science of Team Cognition . Human Factors: The Journal of the Human Factors and Ergonomics Society, 66(6):1669--1680

work page doi:10.1177/00187208231162449 2024
[17]

Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, and Ziran Wang. 2023. https://doi.org/10.1145/3583740.3626806 Human- Autonomy Teaming on Autonomous Vehicles with Large Language Model-Enabled Human Digital Twins . In 2023 IEEE / ACM Symposium on Edge Computing ( SEC ) , pages 319--324

work page doi:10.1145/3583740.3626806 2023
[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.48550/arXiv.1810.04805 BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding . Preprint, arXiv:1810.04805

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805 2019
[19]

Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. https://api.semanticscholar.org/CorpusID:258569852 Can ai language models replace human participants? Trends in Cognitive Sciences, 27:597--600

2023
[20]

Yifan Duan, Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, and Min Zhang. 2025. https://doi.org/10.48550/arXiv.2502.20859 The Power of Personality : A Human Simulation Perspective to Investigate Large Language Model Agents . Preprint, arXiv:2502.20859

work page doi:10.48550/arxiv.2502.20859 2025
[21]

Mica R Endsley. 2023. Supporting human-ai teams: Transparency, explainability, and situation awareness. Computers in Human Behavior, 140:107574

2023
[22]

Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, and Ji-Rong Wen. 2024. https://arxiv.org/abs/2402.12914 Large language model-based human-agent collaboration for complex task solving . Preprint, arXiv:2402.12914

work page arXiv 2024
[23]

George Fragiadakis, Christos Diou, George Kousiouris, and Mara Nikolaidou. 2024. Evaluating human-ai collaboration: A review and methodological framework. arXiv preprint arXiv:2407.19098

work page arXiv 2024
[24]

Ivar Frisch and Mario Giulianelli. 2024. https://doi.org/10.48550/arXiv.2402.02896 LLM Agents in Interaction : Measuring Personality Consistency and Linguistic Alignment in Interacting Populations of Large Language Models . Preprint, arXiv:2402.02896

work page doi:10.48550/arxiv.2402.02896 2024
[25]

Justin Garten, Reihane Boghrati, Joe Hoover, Kate M Johnson, and Morteza Dehghani. 2016. Morality between the lines: Detecting moral sentiment in text. In Proceedings of IJCAI 2016 Workshop on Computational Modeling of Attitudes

2016
[26]

P. A. Hancock, Theresa T. Kessler, Alexandra D. Kaplan, Kimberly Stowers, J. Christopher Brill, Deborah R. Billings, Kristin E. Schaefer, and James L. Szalma. 2023. https://doi.org/10.3389/fpsyg.2023.1081086 How and why humans trust: A meta-analysis and elaborated model . Frontiers in Psychology, 14

work page doi:10.3389/fpsyg.2023.1081086 2023
[27]

Hancock, Deborah R

Peter A. Hancock, Deborah R. Billings, Kristin E. Schaefer, Jessie Y. C. Chen, Ewart J. de Visser , and Raja Parasuraman. 2011. https://doi.org/10.1177/0018720811417254 A Meta-Analysis of Factors Affecting Trust in Human-Robot Interaction . Human Factors: The Journal of the Human Factors and Ergonomics Society, 53(5):517--527

work page doi:10.1177/0018720811417254 2011
[28]

Laura Hanu and Unitary team . 2020. https://doi.org/10.5281/zenodo.7925667 Detoxify

work page doi:10.5281/zenodo.7925667 2020
[29]

Thomas F Heston and Justin Gillette. 2025. https://doi.org/10.7759/cureus.84706 Large Language Models Demonstrate Distinct Personality Profiles . Cureus

work page doi:10.7759/cureus.84706 2025
[30]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. https://arxiv.org/abs/2308.00352 Metagpt: Meta programming for a multi-agent collaborative framework . Preprint, arXiv:2308.00352

work page internal anchor Pith review arXiv 2024
[31]

Yin Jou Huang and Rafik Hadfi. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.605 How Personality Traits Influence Negotiation Outcomes ? A Simulation based on Large Language Models . In Findings of the Association for Computational Linguistics : EMNLP 2024 , pages 10336--10351, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.605 2024
[32]

Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P

Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, and Graham Neubig. 2025. https://doi.org/10.18653/v1/2025.naacl-demo.17 Cowpilot: A framework for autonomous and human-agent collaborative web navigation . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational L...

work page doi:10.18653/v1/2025.naacl-demo.17 2025
[33]

Sai Mounika Inavolu. 2024. Exploring ai-driven customer service: Evolution, architectures, opportunities, challenges and future directions. International Journal of Engineering and Advanced Technology, 13(3):156--163

2024
[34]

Oliver P John, Eileen M Donahue, and Robert L Kentle. 1991. Big five inventory. Journal of personality and social psychology

1991
[35]

Michael Knop, Sebastian Weber, Marius Mueller, and Bjoern Niehaves. 2022. https://doi.org/10.2196/28639 Human Factors and Technological Characteristics Influencing the Interaction of Medical Professionals With Artificial Intelligence – Enabled Clinical Decision Support Systems : Literature Review . JMIR Human Factors, 9(1):e28639

work page doi:10.2196/28639 2022
[36]

Trust in automation: Designing for appro- priate reliance

John D. Lee and Katrina A. See. 2004. https://doi.org/10.1518/hfes.46.1.50_30392 Trust in Automation : Designing for Appropriate Reliance . Human Factors, 46(1):50--80

work page doi:10.1518/hfes.46.1.50_30392 2004
[37]

Young-Jun Lee, Chae-Gyun Lim, and Ho-Jin Choi. 2022. Does GPT-3 Generate Empathetic Dialogues ? A Novel In-Context Example Selection Method and Automatic Evaluation Metric for Empathetic Dialogue Generation . In Proceedings of the 29th International Conference on Computational Linguistics , pages 669--683, Gyeongju, Republic of Korea. International Commit...

2022
[38]

Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, and Zhifang Sui. 2025. https://doi.org/10.18653/v1/2025.findings-acl.813 How far are LLM s from being our digital twins? a benchmark for persona-based behavior chain simulation . In Findings of the Association for Computational Linguistics: ACL 2025, pages 15738--15763, Vienna, Austria. A...

work page doi:10.18653/v1/2025.findings-acl.813 2025
[39]

Claire Liang, Julia Proft, Erik Andersen, and Ross A. Knepper. 2019. https://doi.org/10.1145/3290605.3300325 Implicit communication of actionable information in human-ai teams . In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI '19, page 1–13, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3290605.3300325 2019
[40]

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, and Dakuo Wang. 2025. https://arxiv.org/abs/2503.20749 Prompting is not all you need! evaluating llm agent simulation methodologies with real-world online customer behavior data . Preprint, arXiv:2503.20749

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

MacCallum, Shaobo Zhang, Kristopher J

Robert C. MacCallum, Shaobo Zhang, Kristopher J. Preacher, and Derek D. Rucker. 2002. https://doi.org/10.1037/1082-989x.7.1.19 On the practice of dichotomization of quantitative variables . Psychological Methods, 7(1):19--40

work page doi:10.1037/1082-989x.7.1.19 2002
[42]

Robert R McCrae and Oliver P John. 1992. An introduction to the five-factor model and its applications. Journal of personality, 60(2):175--215

1992
[43]

Cohen, Hsien-Te Kao, Grant Engberson, Louis Penafiel, Spencer Lynch, Robert McCormack, Laura Cassani, and Svitlana Volkova

Daniel Nguyen, Myke C. Cohen, Hsien-Te Kao, Grant Engberson, Louis Penafiel, Spencer Lynch, Robert McCormack, Laura Cassani, and Svitlana Volkova. 2025. https://doi.org/10.1075/is.24052.ngu Exploratory models of human- AI teams: Leveraging human digital twins to investigate trust development . Interaction Studies, 26(2):267--297

work page doi:10.1075/is.24052.ngu 2025
[44]

Spatola Nicolas. 2025. https://doi.org/10.1016/j.chbah.2025.100169 To Be competitive or not to be competitive: How performance goals shape human- AI and human-human collaboration . Computers in Human Behavior: Artificial Humans, 5:100169

work page doi:10.1016/j.chbah.2025.100169 2025
[45]

Bo Pan, Jiaying Lu, Ke Wang, Li Zheng, Zhen Wen, Yingchaojie Feng, Minfeng Zhu, and Wei Chen. 2024 a . https://arxiv.org/abs/2404.11943 Agentcoord: Visually exploring coordination strategy for llm-based multi-agent collaboration . Preprint, arXiv:2404.11943

work page arXiv 2024
[46]

Lihang Pan, Yuxuan Li, Chun Yu, and Yuanchun Shi. 2024 b . https://arxiv.org/abs/2404.15974 A human-computer collaborative tool for training a single large language model agent into a network through few examples . Preprint, arXiv:2404.15974

work page arXiv 2024
[47]

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2022. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1--18

2022
[48]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, and Michael S. Bernstein. 2024. https://arxiv.org/abs/2411.10109 Generative agent simulations of 1,000 people . Preprint, arXiv:2411.10109

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Judea Pearl and Dana Mackenzie. 2018. The Book of Why : The New Science of Cause and Effect , 1st edition edition. Basic Books, New York

2018
[50]

Petrov, Gregory Serapio-Garc \'i a , and Jason Rentfrow

Nikolay B. Petrov, Gregory Serapio-Garc \'i a , and Jason Rentfrow. 2024. https://doi.org/10.48550/arXiv.2405.07248 Limited Ability of LLMs to Simulate Human Psychological Behaviours : A Psychometric Analysis . Preprint, arXiv:2405.07248

work page doi:10.48550/arxiv.2405.07248 2024
[51]

Muhammad Raees, Inge Meijerink, Ioanna Lykourentzou, Vassilis-Javed Khan, and Konstantinos Papangelis. 2024. https://doi.org/10.1016/j.ijhcs.2024.103301 From explainable to interactive AI : A literature review on current trends in human- AI interaction . International Journal of Human-Computer Studies, 189:103301

work page doi:10.1016/j.ijhcs.2024.103301 2024
[52]

Hannah Rashkin, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, and Yejin Choi. 2017. https://doi.org/10.18653/v1/D17-1317 Truth of Varying Shades : Analyzing Language in Fake News and Political Fact-Checking . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 2931--2937, Copenhagen, Denmark. Association for Co...

work page doi:10.18653/v1/d17-1317 2017
[53]

Hannah Rashkin, Sameer Singh, and Yejin Choi. 2016. https://doi.org/10.48550/arXiv.1506.02739 Connotation Frames : A Data-Driven Investigation . Preprint, arXiv:1506.02739

work page doi:10.48550/arxiv.1506.02739 2016
[54]

Bhadresh Savani. 2024. DistilBERT for emotion recognition

2024
[55]

Shadish, Thomas D

William R. Shadish, Thomas D. Cook, and Donald T. Campbell. 2001. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin, Boston

2001
[56]

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. 2024. Collaborative gym: A framework for enabling and evaluating human-agent collaboration. arXiv preprint arXiv:2412.15701

work page arXiv 2024
[57]

Ashish Sharma, Sudha Rao, Chris Brockett, Akanksha Malhotra, Nebojsa Jojic, and Bill Dolan. 2024. https://doi.org/10.18653/v1/2024.eacl-long.119 Investigating agency of LLM s in human- AI collaboration tasks . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1968-...

work page doi:10.18653/v1/2024.eacl-long.119 2024
[58]

Ben Shneiderman. 2020. https://doi.org/10.1080/10447318.2020.1741118 Human- Centered Artificial Intelligence : Reliable , Safe & Trustworthy . International Journal of Human–Computer Interaction, 36(6):495--504. Publisher: Taylor & Francis \_eprint: https://doi.org/10.1080/10447318.2020.1741118

work page doi:10.1080/10447318.2020.1741118 2020
[59]

Zhe Su, Xuhui Zhou, Sanketh Rangreji, Anubha Kabra, Julia Mendelsohn, Faeze Brahman, and Maarten Sap. 2025. https://doi.org/10.18653/v1/2025.naacl-long.595 AI - L ie D ar : Examine the trade-off between utility and truthfulness in LLM agents . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page doi:10.18653/v1/2025.naacl-long.595 2025
[60]

Towards friendly ai: A comprehensive review and new perspectives on human-ai alignment.arXiv preprint arXiv:2412.15114, 2024

Qiyang Sun, Yupei Li, Emran Alturki, Sunil Munthumoduku Krishna Murthy, and Björn W. Schuller. 2024. https://doi.org/10.48550/arXiv.2412.15114 Towards Friendly AI : A Comprehensive Review and New Perspectives on Human - AI Alignment . arXiv preprint. ArXiv:2412.15114 [cs]

work page doi:10.48550/arxiv.2412.15114 2024
[61]

S Volkova, M Glenski, E Ayton, E Saldanha, J Mendoza, D Arendt, Z Shaw, K Cronk, S Smith, and M Greaves. 2021. https://arxiv.org/abs/27036529 Machine Intelligence to Detect , Characterise , and Defend against Influence Operations in the Information Environment . Journal of Information Warfare, 20(2):42--66

work page arXiv 2021
[62]

Svitlana Volkova, Dustin Arendt, Emily Saldanha, Maria Glenski, Ellyn Ayton, Joseph Cottam, Sinan Aksoy, Brett Jefferson, and Karthnik Shrivaram. 2023. https://doi.org/10.1007/s10588-021-09351-y Explaining and predicting human behavior and social dynamics in simulated virtual worlds: Reproducibility, generalizability, and robustness of causal discovery me...

work page doi:10.1007/s10588-021-09351-y 2023
[63]

Ford, Michael G

Svitlana Volkova, Daniel Nguyen, Louis Penafiel, Hsien-Te Kao, Myke Cohen, Grant Engberson, Laura Cassani, Mohammed Almutairi, Charles Chiang, Nandini Banerjee, Matthew Belcher, Trenton W. Ford, Michael G. Yankoski, Tim Weninger, Diego Gomez-Zara, and Summer Rebensky. 2025. https://doi.org/10.1007/978-3-031-92970-0_20 VirTLab : Augmented Intelligence for ...

work page doi:10.1007/978-3-031-92970-0_20 2025
[64]

Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, James Evans, Philip Torr, Bernard Ghanem, and Guohao Li. 2024. https://arxiv.org/abs/2402.04559 Can large language model agents simulate human trust behavior? Preprint, arXiv:2402.04559

work page arXiv 2024
[65]

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. 2025. https://arxiv.org/abs/2503.16416 Survey on evaluation of llm-based agents . Preprint, arXiv:2503.16416

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Building cooperative embodied agents modularly with large language models

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. 2024 a . https://arxiv.org/abs/2307.02485 Building cooperative embodied agents modularly with large language models . Preprint, arXiv:2307.02485

work page arXiv 2024
[67]

Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, and Esin Durmus. 2025. https://doi.org/10.48550/arXiv.2510.07686 Stress- Testing Model Specs Reveals Character Differences among Language Models . Preprint, arXiv:2510.07686

work page doi:10.48550/arxiv.2510.07686 2025
[68]

Yiqun Zhang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, and Kaisong Song. 2024 b . https://arxiv.org/abs/2408.04472 Can llms beat humans in debating? a dynamic multi-agent framework for competitive debate . Preprint, arXiv:2408.04472

work page arXiv 2024
[69]

Xuhui Zhou, Zhe Su, Sophie Feng, Jiaxu Zhou, Jen-tse Huang, Hsien-Te Kao, Spencer Lynch, Svitlana Volkova, Tongshuang Wu, Anita Woolley, Hao Zhu, and Maarten Sap. 2025. https://doi.org/10.18653/v1/2025.naacl-demo.30 SOTOPIA -s4: a user-friendly system for flexible, customizable, and large-scale social simulation . In Proceedings of the 2025 Conference of ...

work page doi:10.18653/v1/2025.naacl-demo.30 2025
[70]

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2024. https://doi.org/10.48550/arXiv.2310.11667 SOTOPIA : Interactive Evaluation for Social Intelligence in Language Agents . Preprint, arXiv:2310.11667

work page doi:10.48550/arxiv.2310.11667 2024
[71]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[72]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...