pith. machine review for the scientific record. sign in

arxiv: 2511.13979 · v2 · submitted 2025-11-17 · 💻 cs.HC

Personality Pairing Improves Human-AI Collaboration

Pith reviewed 2026-05-17 20:07 UTC · model grok-4.3

classification 💻 cs.HC
keywords personality pairinghuman-AI collaborationBig Five personalityrandomized experimentad qualityclick-through ratesteamwork dynamics
0
0 comments X

The pith

Specific personality pairings can improve human-AI collaboration performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether matching or differing human and AI personalities changes how effectively they complete joint creative work. Researchers randomly assigned 1,258 people to work with AI agents prompted to show different levels of Big Five traits, then had the resulting ads rated by thousands of independent evaluators and run in a real advertising campaign on X that reached millions of users. Certain pairings produced noticeably different ad quality, and some like neurotic humans with neurotic AI drove higher click rates even after quality was accounted for. If true, this would mean that future AI tools could be tuned to a user's personality to raise output and reduce friction in everyday teamwork tasks.

Core claim

In a preregistered randomized experiment, 1,258 participants paired with AI agents prompted for varying Big Five personality traits produced 7,266 ads whose quality was shaped by both the human's and the AI's traits. Specific pairings directly affected outcomes, with extraverted humans paired with conscientious AI yielding the lowest quality, and neurotic humans paired with neurotic AI achieving higher click-through rates in a field experiment with nearly 5 million impressions, even after controlling for ad quality. These results provide the first large-scale causal experimental evidence that specific personality pairings can improve human-AI collaboration.

What carries the argument

Randomized pairing of human Big Five personality traits with AI agents prompted to exhibit targeted levels of the same traits, with outcomes measured by independent ad quality ratings and real-world performance metrics.

If this is right

  • Human personalities and AI personalities each shape ad quality and teamwork on their own.
  • Certain human-AI personality combinations produce measurable shifts in output quality.
  • Ad quality generated under different pairings influences real-world metrics such as click-through rates.
  • Some pairings improve performance metrics beyond the effect of quality alone.
  • These patterns support further work on personalizing AI agents to individual users.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairing logic could be tested in non-advertising tasks such as writing or data analysis.
  • AI interfaces might eventually detect a user's personality traits automatically to select a suitable agent profile.
  • Workplace teams that use AI assistants could see gains if agents are assigned according to employee traits.
  • Longer-term studies could check whether the effects persist across repeated collaborations.

Load-bearing premise

Humans perceived and responded to the AI agents' prompted personality traits as intended, and ad quality ratings accurately captured collaboration effectiveness.

What would settle it

A replication experiment in which participants show no difference in perceived AI personality or in final ad quality and click rates across the same set of prompted trait levels.

Figures

Figures reproduced from arXiv: 2511.13979 by Harang Ju, Sinan Aral.

Figure 1
Figure 1. Figure 1: The experimental design for investigating human-AI collaboration. (A) Participants are randomized into collaborating with either a human partner or an AI agent. (B) AI agents are randomly assigned prompts to induce low or high levels of the Big Five personality traits. (C) Pairs collaborate in a real-time collaborative workspace to create display ads. Importantly, we approach personality prompts as neither… view at source ↗
Figure 2
Figure 2. Figure 2: Regression coefficients for human-AI personality interaction terms for teamwork quality [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Regression coefficients for human-AI personality interaction terms on ad text quality, [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Regression coefficients for interaction terms of a participants’ country of birth and AI [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Regression coefficients for human-AI personality interaction terms and ad quality ratings [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Regression coefficients for human-AI personality interaction terms for productivity [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The Pairit platform. The task workspace is on the left, and the messaging interface on the right. The participant chats with an AI agent which has full context of the user interface and can modify all elements on the task panel. To enable the AI agents to act as effective collaborators on the Pairit platform, we provided them with full context of the task and collaboration by prompting each LLM call with a… view at source ↗
read the original abstract

Here we examine how AI agent "personalities" interact with human personalities to shape human-AI collaboration and performance. In a large-scale, preregistered randomized experiment, we paired 1,258 participants with AI agents prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, which we evaluated using 1,995 independent human raters and a field experiment on X that generated nearly 5 million impressions. We found that human and AI personalities individually shaped ad quality and teamwork. When examined together, human-AI personality pairings directly effected ad quality outcomes. For example, extraverted humans paired with conscientious AI produced the lowest-quality ads, followed by conscientious humans paired with agreeable AI and neurotic humans paired with conscientious AI. In the field experiment, ad quality significantly influenced ad performance, measured by click-through rates and cost-per-click, and neurotic humans paired with neurotic AI achieved higher click-through rates, even after controlling for ad quality. Together, these results provide the first large-scale causal experimental evidence that specific personality pairings can improve human-AI collaboration and motivate future research on the implications of AI personalization for performance and teamwork dynamics in human-AI teams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that specific pairings of human and AI Big-Five personality traits causally affect collaboration outcomes. In a preregistered randomized experiment, 1,258 participants were paired with AI agents prompted to exhibit varying Big-Five traits; the resulting 7,266 ads were rated by 1,995 independent raters for quality and tested in a field experiment on X yielding nearly 5 million impressions. Key findings include lower ad quality for extraverted humans paired with conscientious AI and higher click-through rates for neurotic human-neurotic AI pairs even after controlling for ad quality.

Significance. If the results hold, the work supplies the first large-scale causal evidence that personality pairings shape human-AI collaboration performance. Strengths include the preregistered randomized design, large N, multiple independent raters, and a real-world field experiment with external metrics (click-through rates, cost-per-click) that are independent of the personality prompts.

major comments (1)
  1. [Methods] Methods: No manipulation check is reported to verify that participants perceived or responded to the prompted AI personality traits as intended. Without such verification, the observed pairing effects (e.g., extraverted humans + conscientious AI yielding lowest-quality ads) cannot be confidently attributed to personality matching rather than prompt wording, model stochasticity, or task artifacts.
minor comments (2)
  1. [Results] Results: Report inter-rater reliability (e.g., Cronbach’s alpha or ICC) for the 1,995 ad-quality raters to support the claim that ratings validly capture collaboration effectiveness.
  2. [Results] Abstract and Results: Clarify the exact statistical controls used when reporting that neurotic pairings achieve higher click-through rates “even after controlling for ad quality.”

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the strengths of our preregistered design, large sample, and field experiment. We address the single major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods: No manipulation check is reported to verify that participants perceived or responded to the prompted AI personality traits as intended. Without such verification, the observed pairing effects (e.g., extraverted humans + conscientious AI yielding lowest-quality ads) cannot be confidently attributed to personality matching rather than prompt wording, model stochasticity, or task artifacts.

    Authors: We agree that confirming participants' perceptions of the AI traits would strengthen causal attribution to personality matching specifically. Our experiment randomizes assignment to AI prompt conditions that were constructed from validated Big-Five trait descriptions, and the primary outcomes (independent ad-quality ratings by 1,995 raters and field click-through rates from nearly 5 million impressions) are measured independently of any participant self-report. These downstream metrics therefore reflect the actual content generated under each prompt. In the revision we will (1) append the exact prompt templates for each personality condition, (2) add a post-hoc linguistic analysis of the 7,266 generated ads using established Big-Five dictionaries to document trait-consistent language, and (3) explicitly note the absence of a participant-level manipulation check as a limitation while emphasizing the objective, externally validated performance measures. Because the data have already been collected, we cannot add a new real-time manipulation check, but the planned additions will improve transparency and help isolate the contribution of the personality prompts. revision: partial

Circularity Check

0 steps flagged

Purely empirical randomized experiment with independent external metrics

full rationale

The paper describes a preregistered randomized experiment that pairs participants with AI agents prompted for Big-Five traits, then measures ad quality via independent human raters and field click-through rates on X. No equations, fitted parameters, or derivations are present. Central claims rest on measured outcomes that are statistically independent of the input prompts once randomization is applied. No self-citations are invoked as uniqueness theorems or load-bearing premises. The design is self-contained against external benchmarks (rater evaluations and live ad performance), satisfying the criteria for a non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard experimental assumptions rather than new mathematical entities. No free parameters are fitted to produce the result; personality levels are chosen by design. No invented entities are introduced.

axioms (2)
  • standard math Random assignment to personality conditions produces comparable groups except for the treatment.
    Invoked by the randomized experiment design described in the abstract.
  • domain assumption Prompted Big Five traits in the AI are perceived by humans as intended.
    Required for interpreting the pairing effects; not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5512 in / 1360 out tokens · 31794 ms · 2026-05-17T20:07:16.512031+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    24 Anthropic

    doi: 10.1287/orsc.2022.1651. 24 Anthropic. Claude character. https://www.anthropic.com/research/claude-character,

  2. [2]

    Ac- cessed: 2025-08-29. M. R. Barrick, G. L. Stewart, M. J. Neubert, and M. K. Mount. Relating member ability and personality to work-team processes and team effectiveness.Journal of applied psychology, 83(3):377,

  3. [3]

    URL https://doi.org/10.1177/ 00222429241275886

    doi: 10.1177/00222429241275886. URL https://doi.org/10.1177/ 00222429241275886. M. Braun, B. de Langhe, S. Puntoni, and E. M. Schwartz. Leveraging digital advertising platforms for consumer research.Journal of Consumer Research, 51(1):119–128, 05

  4. [4]

    doi: 10.1093/jcr/ucad058

    ISSN 0093-5301. doi: 10.1093/jcr/ucad058. URLhttps://doi.org/10.1093/jcr/ucad058. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, ...

  5. [5]

    URLhttps://arxiv.org/abs/2507.21509. P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav. Mem0: Building production-ready ai agents with scalable long-term memory,

  6. [6]

    URLhttps://arxiv.org/abs/2504.19413. K. M. Collins, I. Sucholutsky, U. Bhatt, K. Chandra, L. Wong, M. Lee, C. E. Zhang, T. Zhi-Xuan, M. Ho, V . Mansinghka, et al. Building machines that learn and think with people.Nature Human Behaviour, 8 (10):1851–1863,

  7. [7]

    URLhttps://arxiv.org/abs/2503.17473. A. Fügener, J. Grahl, A. Gupta, and W. Ketter. Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation.Information Systems Research, 33(2): 678–696,

  8. [8]

    doi: 10.1145/3581641. 3584052. M. Hoegl and H. G. Gemuenden. Teamwork quality and the success of innovative projects: A theoretical concept and empirical evidence.Organization Science, 12(4):435–449,

  9. [9]

    doi: 10.1287/orsc.12.4.435. 10635. A. Humlum and E. Vestergaard. The adoption of chatgpt. Working Paper 2024-50, University of Chicago, Becker Friedman Institute for Economics,

  10. [10]

    Available at SSRN: https://ssrn.com/abstract= 4807516. M. Jakesch, A. Bhat, D. Buschek, L. Zalmanson, and M. Naaman. Co-writing with opinionated language mod- els affects users’ views. InProceedings of the 2023 CHI Conference on Human Factors in Computing Sys- tems, CHI ’23, New York, NY , USA,

  11. [11]

    ISBN 9781450394215

    Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3581196. URLhttps://doi.org/10.1145/3544548.3581196. G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models,

  12. [12]

    URLhttps://arxiv.org/abs/2503.18238. M. Jung and P. Hinds. Robots in the wild: A time for more robust theories of human-robot interaction.J. Hum.- Robot Interact., 7(1), May

  13. [13]

    URLhttps://doi.org/10.1145/3208975

    doi: 10.1145/3208975. URLhttps://doi.org/10.1145/3208975. M. F. Jung, J. J. Lee, N. DePalma, S. O. Adalgeirsson, P. J. Hinds, and C. Breazeal. Engaging robots: easing complex human-robot teamwork using backchanneling. InProceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, page 1555–1566, New York, NY , USA,

  14. [14]

    ISBN 9781450313315

    Association for Computing Machinery. ISBN 9781450313315. doi: 10.1145/2441776.2441954. URL https://doi.org/10.1145/2441776.2441954. S. L. Kichuk and W. H. Wiesner. The big five personality factors and team performance: implications for selecting successful product design teams.Journal of Engineering and Technology management, 14(3-4): 195–221,

  15. [15]

    doi: 10.1126/science.adh2586. OpenAI. Customizing your ChatGPT personality. https://help.openai.com/en/articles/ 11899719-customizing-your-chatgpt-personality, 2025a. Accessed: 2025-08-29. OpenAI. Sycophancy in GPT-4o. https://openai.com/index/sycophancy-in-gpt-4o/ , 2025b. Ac- cessed: 2025-08-29. N. Otis, R. Clarke, S. Delecourt, D. Holtz, and R. Koning....

  16. [16]

    URLhttps://arxiv.org/abs/2503.06195. M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield- Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, and E. Perez. Towards understanding sycophancy in language models,

  17. [17]

    URL https://arxiv.org/abs/2310.13548. S. M. Tully, C. Longoni, and G. Appel. Lower artificial intelligence literacy predicts greater ai receptivity. Journal of Marketing, 0:00222429251314491,

  18. [18]

    doi: 10.1038/s41562-024-02024-1. Y . Wan, J. Wu, M. Abdulhai, L. Shani, and N. Jaques. Enhancing personalized multi-turn dialogue with curiosity reward,

  19. [19]

    URLhttps://arxiv.org/abs/2504.03206. W. Wang, G. Gao, and R. Agarwal. Friend or foe? teaming between artificial intelligence and workers with variation in experience.Management Science, 70(9):5753–5775,

  20. [20]

    doi: 10.1287/mnsc.2021.00588. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Lichter, F. Xia, E. Chi, Q. V . Le, and D. Zhou. Chain-of- thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Cu...

  21. [21]

    URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. E. Wiles, Z. T. Munyikwa, and J. J. Horton. Algorithmic writing assistance on jobseekers’ resumes increases hires. Working Paper 30886, National Bureau of Economic Research,