pith. sign in

arxiv: 2605.18302 · v1 · pith:ISMOV7G5new · submitted 2026-05-18 · 💻 cs.HC

What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience

Pith reviewed 2026-05-20 08:47 UTC · model grok-4.3

classification 💻 cs.HC
keywords synthetic participantsGPTfirst-click testsuser experiencebehavioral misalignmentLLM limitationsUX research
0
0 comments X

The pith

GPT fails to match human clicking patterns in more than half of real UX first-click tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether GPT can serve as a reliable synthetic participant in user experience research by comparing its click predictions and reasoning to actual human data. Across twelve first-click tests involving over three thousand real users, the AI produces significantly different click distributions in 53 percent of cases and misses key cognitive aspects of user decisions. Even enhancements like detailed personas or step-by-step reasoning prompts do not fix the core mismatches, only making outputs seem more plausible. The authors link these problems to the fundamental way large language models process information from text rather than real behavioral experiences. This matters because relying on such synthetic data could lead to misguided product design choices in industry settings.

Core claim

Within twelve diverse first click tests obtained from real UX practice, GPT demonstrates critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes, with significantly different distributions from real data in 53% of tasks. Participant personas, chain-of-thought reasoning, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. The observed distortions in synthetic responses reduce their overall analytical usefulness as a decision-making resource compared with real data, which can be linked to the statistical nature of LLMs and their encoding of semantic heuristics from训练

What carries the argument

Comparison of GPT-generated first-click predictions and explanations against human participant data from twelve real-world UX tests, using statistical distribution differences to measure misalignment.

If this is right

  • Synthetic data from GPT introduces multiple nuanced distortions that compromise the validity of UX research findings.
  • Prompting techniques such as personas or chain-of-thought reasoning do not produce meaningful improvements in behavioral fidelity.
  • The statistical and heuristic-driven properties of LLMs inherently restrict their capacity to simulate actual user interactions.
  • Real human participant data remains necessary for trustworthy decision-making in user experience design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI simulation of human behavior in research may need training on actual interaction logs rather than text corpora alone to reach usable fidelity.
  • The same limitations could affect LLM use in any domain that models human decision processes, such as market research or policy testing.
  • Teams adopting synthetic UX testing should add targeted real-user checks to catch distortions before they influence product choices.

Load-bearing premise

The twelve first-click tests drawn from real UX practice are sufficiently diverse and representative to support general claims about GPT's misalignment with human behavior across user experience research.

What would settle it

Repeating the study on a new collection of first-click tests or other UX methods and finding that GPT matches real-user click distributions and reasoning in the majority of cases would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18302 by Eduard Kuric, Matus Krajcovic, Peter Demcak.

Figure 1
Figure 1. Figure 1: Simulation procedure flowchart. Instructions are meta-directives applied across studies, content represents study-specific instructions and stimuli from the original source, output defines format. We conducted the explorative qualitative portion of our analysis with the aim of explaining the identified patterns, as well as studying the additional properties of synthetic data that the quantitative perspecti… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of first click behavioral measures between simulation conditions, evaluated using data from all studies. whether the victor was actually the correct solution based on design intent or participant solutions. Since GPT is a blackbox, the method for selecting the victor is unknown, although we found that they were all elements with some semantic connection to the task. GPT did not apply this strate… view at source ↗
Figure 3
Figure 3. Figure 3: Line chart of Likert scale response distributions plotting all tasks to compare synthetic data (left) and real data (right). Statistically, score 7 synthetic SD = 8.68%, real SD = 22.58%; score 1 synthetic SD = 0.47%, real SD = 7.98%. Open-ended follow-ups typically focused on justifications of clicks and rating choices, impressions and 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linguistic similarity between synthetic responses to open-ended questions and real data across the evaluated conditions [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Boxplot comparison between synthetic and real responses to open-ended follow-up questions aggregated across FCT studies. Our reflexive thematic analysis revealed several themes that capture distortions [45]: 4.2.1 Superficiality and homogeneity The deficient internal depth of GPT responses is an antecedent to the external lack of variability between them. The contents of GPT responses typically attributed … view at source ↗
read the original abstract

Synthetic participants represent a methodologically concerning concept that threatens the integrity of UX research. Findings from previous experiments specify how AI outputs are misaligned with the behaviors and thoughts of real humans in various ways. However, industry voices keep underestimating their severity, advocating for practical compromises where good-enough data does not need to be perfect, and all issues will be solved by future tuning. Our study tackles the lack of systematic understanding of the practical issues that arise with synthetic behavior and its use for steering decisions within real contexts. Within twelve diverse first click tests (n = 3431) obtained from real UX practice, we examine the ability of GPT to predict where humans click and how they reason about their behavior. Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes. Participant personas, chain-of-thought reasoning in GPT, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. We expose a multitude of nuanced distortions in synthetic responses that reduce their overall analytical usefulness as a decision-making resource, compared with real data. Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that GPT exhibits critical misalignment with human click behavior and reasoning in first-click UX tests. Across 12 tasks drawn from real practice (total n=3431 real users), GPT produces significantly different click distributions in 53% of cases; common mitigations such as participant personas and chain-of-thought prompting yield only marginal or illusory improvements in fidelity. The authors attribute these distortions to inherent LLM properties (statistical next-token prediction and training on linguistic data) and conclude that synthetic participants therefore threaten the integrity of UX research and decision-making.

Significance. If the empirical comparison holds, the work is significant for HCI because it supplies a sizable, multi-task, real-user baseline that quantifies practical distortions in click patterns and cognitive rationales. The large real-user sample (n=3431) and use of tasks obtained from actual UX practice are clear strengths that move the discussion beyond small-scale or purely synthetic evaluations. The study also supplies falsifiable, task-level predictions that can be checked against future human data.

major comments (2)
  1. [Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.
  2. [Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.
minor comments (2)
  1. [Results] A table or figure that lists per-task click-distribution statistics (e.g., chi-square values, p-values, effect sizes) alongside the real vs. synthetic comparison would improve readability of the central result.
  2. [Abstract / Introduction] The abstract and introduction could more explicitly name the GPT model version and release date used, as these details matter for reproducibility in a rapidly changing field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. Their comments on methodological transparency are well-taken and have prompted us to strengthen the paper. We respond to each major comment below and indicate the changes we will incorporate in the revised version.

read point-by-point responses
  1. Referee: [Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.

    Authors: We appreciate the referee's emphasis on documenting task selection to support generalizability claims. The twelve tasks were drawn directly from professional UX testing archives to capture realistic first-click scenarios across common interface contexts. To address the concern, the revised manuscript will include an expanded Methods subsection with a summary table listing each task's domain (e.g., e-commerce checkout, news article navigation, SaaS dashboard), interface type (desktop web or mobile), primary user goal, and approximate complexity indicators such as number of visible elements. This addition will allow readers to evaluate representativeness directly and will clarify that the 53% rate reflects patterns observed across a range of practical tasks rather than a narrow subset. We maintain that the core finding of misalignment remains diagnostic of LLM properties, but the added documentation will reduce any ambiguity about task specificity. revision: yes

  2. Referee: [Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.

    Authors: We agree that precise procedural details are necessary for reproducibility and for readers to assess the robustness of the 53% figure and the prompting results. The original submission provided a high-level description; the revised manuscript will expand the Experimental Procedure section to include the verbatim base prompt template, the exact persona and chain-of-thought variants, the temperature and sampling settings used (temperature = 1.0, top_p = 1.0 for the primary runs), the statistical test and threshold applied (chi-square goodness-of-fit tests with p < 0.05, Bonferroni-adjusted for the 12 tasks), and the real-user exclusion criteria (incomplete sessions, failed attention checks, or response times below a minimum threshold). These details will be presented in the main text with references to supplementary materials containing the full prompt strings. This revision will directly support verification of both the misalignment rates and the limited efficacy of the tested mitigations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to external human data.

full rationale

The paper's central claims rest on direct empirical comparisons between GPT outputs and real human click distributions collected from twelve first-click tests obtained from UX practice. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or prior author citations; the 53% misalignment rate and related distortions are computed against independent external benchmarks rather than derived from the paper's own inputs. The analysis of personas, chain-of-thought, and sampling parameters is likewise evaluated against the same external human data, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and draws on standard statistical comparison methods plus the background assumption that real-user click data constitutes ground truth for UX behavior.

axioms (1)
  • standard math Statistical significance tests can reliably identify differences in click distributions between human and AI responses.
    Used to support the claim of significant differences in 53% of tasks.

pith-pipeline@v0.9.0 · 5781 in / 1131 out tokens · 50741 ms · 2026-05-20T08:47:48.917422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

  1. [1]

    Critical artificial intelligence literacy for psychologists, Oct 2025

    Olivia Guest and Iris van Rooij. Critical artificial intelligence literacy for psychologists, Oct 2025. URL https://doi.org/10.31234/osf.io/dkrgj_v1

  2. [2]

    We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025

    Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, and Michelle Fine. We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025. URLhttps://doi.org/10.1177/10778004251401851

  3. [3]

    Sanders, Alex Ulinich, and Bruce Schneier

    Nathan E. Sanders, Alex Ulinich, and Bruce Schneier. Demonstrations of the Potential of AI-based Political Issue Polling.Harvard Data Science Review, 5(4):23, oct 27 2023. URL https://doi.org/10.1162/ 99608f92.1d3cf75d

  4. [4]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URLhttps://arxiv.org/abs/2506.06941

  5. [5]

    Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025

    Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025. URL https: //doi.org/10.1073/pnas.2501660122. 19

  6. [6]

    Adler, and Jun Hwa Cheah

    Monika Imschloss, Marko Sarstedt, Susanne J. Adler, and Jun Hwa Cheah. Using llms in sensory service research: initial insights and perspectives.The Service Industries Journal, 0(0):1–22, 2025. URL https://doi.org/10.1080/02642069.2025.2479723

  7. [7]

    Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

    Marcel Binz and Eric Schulz. Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023. URLhttps://doi.org/10.1073/pnas.2218523120

  8. [8]

    Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

    Grgur Kovaˇ c, Rémy Portelas, Masataka Sawayama, Peter Ford Dominey, and Pierre-Yves Oudeyer. Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

  9. [9]

    URLhttps://doi.org/10.1371/journal.pone.0309114

  10. [10]

    Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127

    W. Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127. Apress, Berkeley, CA, 2018. ISBN 978-1-4842-3867-7. URLhttps://doi.org/10.1007/978-1-4842-3867-7_7

  11. [11]

    Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025

    Matus Krajcovic, Peter Demcak, and Eduard Kuric. Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025. ISSN 0164-1212. URLhttps://doi.org/10.1016/j.jss.2025.112446

  12. [12]

    Dirty clicks: A study of the usability and security implications of click-related behaviors on the web

    Iskander Sanchez-Rola, Davide Balzarotti, Christopher Kruegel, Giovanni Vigna, and Igor Santos. Dirty clicks: A study of the usability and security implications of click-related behaviors on the web. In Proceedings of The Web Conference 2020, WWW ’20, page 395–406, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370233. URLhttps:...

  13. [13]

    A study of first click behaviour and user interaction on the google serp

    Chris Barry and Mark Lardner. A study of first click behaviour and user interaction on the google serp. In Jaroslav Pokorny, Vaclav Repa, Karel Richta, Wita Wojtkowski, Henry Linger, Chris Barry, and Michael Lang, editors,Information Systems Development, pages 89–99, New York, NY, 2011. Springer New York. ISBN 978-1-4419-9790-6

  14. [14]

    Web page graphic design usability testing en- hanced with eye-tracking

    Piotr Chynał, Julia Falkowska, and Janusz Sobecki. Web page graphic design usability testing en- hanced with eye-tracking. In Waldemar Karwowski and Tareq Ahram, editors,Intelligent Human Systems Integration, pages 515–520, Cham, 2018. Springer International Publishing

  15. [15]

    Evaluating and analyzing click simulation in web search

    Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. Evaluating and analyzing click simulation in web search. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, page 281–284, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450344906. URLhttps://doi.org/10.1145/312...

  16. [16]

    A simulation model of intermittently controlled point- and-click behaviour

    Seungwon Do, Minsuk Chang, and Byungjoo Lee. A simulation model of intermittently controlled point- and-click behaviour. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, pages 1–17, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. URLhttps://doi.org/10.1145/3411764.3445514

  17. [17]

    Springer Nature, 2022

    Aleksandr Chuklin, Ilya Markov, and Maarten De Rijke.Click models for web search. Springer Nature, 2022. ISBN 9783031022944

  18. [18]

    Predicting and explaining mobile ui tappability with vision modeling and saliency analysis

    Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–21, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391573. URLh...

  19. [19]

    Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024

    Mikel Villamañe and Ainhoa Alvarez. Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024. URL https://doi.org/ 10.1002/cae.22725

  20. [20]

    Serene: a web platform for the ux semi-automatic evaluation of website

    Andrea Esposito, Giuseppe Desolda, Rosa Lanzilotti, and Maria Francesca Costabile. Serene: a web platform for the ux semi-automatic evaluation of website. InProceedings of the 2022 International Conference on Advanced Visual Interfaces, AVI ’22, pages 1–3, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397193. URLhttps://doi.org...

  21. [21]

    Automated usability tests for mobile devices through live emotions logging

    Jackson Feijó Filho, Thiago Valle, and Wilson Prata. Automated usability tests for mobile devices through live emotions logging. InProceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, MobileHCI ’15, page 636–643, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 978145033...

  22. [22]

    Mingming Fan, Ke Wu, Jian Zhao, Yue Li, Winter Wei, and Khai N. Truong. Vista: Integrating machine intelligence with visualization to support the investigation of think-aloud sessions.IEEE Transactions on Visualization and Computer Graphics, 26(1):343–352, 2020. URL https://doi.org/10.1109/TVCG.2019. 2934797

  23. [23]

    Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022

    Ehsan Jahangirzadeh Soure, Emily Kuang, Mingming Fan, and Jian Zhao. Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022. URLhttps://doi.org/10.1109/TVCG.2021.3114822

  24. [24]

    uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024

    Andrea Batch, Yipeng Ji, Mingming Fan, Jian Zhao, and Niklas Elmqvist. uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024. URLhttps://doi.org/10.1109/TVCG.2023.3241581

  25. [25]

    Simuser: Generating usability feedback by simulating various users interacting with mobile applications

    Wei Xiang, Hanfei Zhu, Suqi Lou, Xinli Chen, Zhenghua Pan, Yuping Jin, Shi Chen, and Lingyun Sun. Simuser: Generating usability feedback by simulating various users interacting with mobile applications. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17, New York, NY, USA, 2024. Association for Computing Ma...

  26. [26]

    Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12, New York, NY, USA, 2025. Assoc...

  27. [27]

    Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, pages 1–13, New York, NY, USA, 2024. Associati...

  28. [28]

    Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025

    Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025. ISSN 1573-1405. URLhttps://doi.org/10.1007/s11263-024-02214-4

  29. [29]

    Guing: A mobile gui search engine using a vision-language model.ACM Trans

    Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, and Walid Maalej. Guing: A mobile gui search engine using a vision-language model.ACM Trans. Softw. Eng. Methodol., 34(4), April 2025. ISSN 1049-331X. URLhttps://doi.org/10.1145/3702993

  30. [30]

    SeeClick: Harnessing GUI grounding for advanced visual GUI agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

  31. [31]

    URL https://www.nature.com/articles/ s41586-024-07930-y

    Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo. Larger and more instructable language models become less reliable.Nature, 634(8032): 61–68, 10 2024. URLhttps://doi.org/10.1038/s41586-024-07930-y

  32. [32]

    Yunting Liu, Shreya Bhandari, and Zachary A. Pardos. Leveraging llm respondents for item evaluation: A psychometric analysis.British Journal of Educational Technology, 56(3):1028–1052, 2025. URL https: //doi.org/10.1111/bjet.13570

  33. [33]

    Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024

    Marco Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024. ISSN 1573-7535. URLhttps://doi.org/10.1007/s10515-023-00409-6

  34. [34]

    PersonaLLM: Investigating the ability of large language models to express personality traits

    Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3605–3627, Mexico City, Mexico, June 2024. Associa...

  35. [35]

    Modeling human subjectivity in LLMs using explicit and implicit human factors in personas

    Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Jane Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle Ungar, and Brenda Curtis. Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2...

  36. [36]

    In: Duh, K., Gomez, H., Bethard, S

    Association for Computational Linguistics. URL https://doi.org/10.18653/v1/2024.findings- emnlp.420

  37. [37]

    Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023

    James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023. URLhttp://dx.doi.org/10.2139/ssrn.4395751

  38. [38]

    Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo

    Guilherme F.C.F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of llms’ moral and legal reasoning.Artificial Intelligence, 333:104145, 2024. ISSN 21 0004-3702. URLhttps://doi.org/10.1016/j.artint.2024.104145

  39. [39]

    Kevin Ma, Daniele Grandi, Christopher McComb, and Kosa Goucher-Lambert. Do large language models produce diverse design concepts? a comparative study with human-crowdsourced solutions.Journal of Computing and Information Science in Engineering, 25(2):024501, 2025. URL https://doi.org/10.1115/1. 4067332

  40. [40]

    Cowie, and Joel Z

    Aliya Amirova, Theodora Fteropoulli, Nafiso Ahmed, Martin R. Cowie, and Joel Z. Leibo. Framework- based qualitative analysis of free responses of large language models: Algorithmic fidelity.PLOS ONE, 19(3):1–33, 03 2024. doi: 10.1371/journal.pone.0300024. URLhttps://doi.org/10.1371/journal.pone. 0300024

  41. [41]

    William Fleeson and Patrick Gallagher. The implications of big five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis.J Pers Soc Psychol, 97 (6):1097–1114, December 2009

  42. [42]

    One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021

    Virginia Braun and Victoria Clarke. One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021. URL https://doi.org/10. 1080/14780887.2020.1769238

  43. [43]

    Using Thematic Analysis in Psychology

    Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative Research in Psychology, 3(2):77–101, 2006. URLhttps://doi.org/10.1191/1478088706qp063oa

  44. [44]

    Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study

    Ida Lumintu. Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study. In2023 IEEE 9th Information Technology International Seminar (ITIS), pages 1–6, Batu Malang, Indonesia, 2023. URL https://doi.org/10.1109/ITIS59651. 2023.10420137

  45. [45]

    A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar

    Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Rajtmajer, and Shomir Wilson. A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar. 2026. URLhttps://doi.org/10.1609/aaai.v40i44.41112

  46. [46]

    Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024

    Jan Karem Höhne, Konstantin Gavras, and Joshua Claassen. Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024. URLhttps://doi.org/10.1177/08944393231160961

  47. [47]

    Arriaga, and Adam Tauman Kalai

    Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

  48. [48]

    John Wiley & Sons, Ltd,

    Kent Bach.Pragmatics and the Philosophy of Language, chapter 21, pages 463–487. John Wiley & Sons, Ltd,

  49. [49]

    URLhttps://doi.org/10.1002/9780470756959.ch21

    ISBN 9780470756959. URLhttps://doi.org/10.1002/9780470756959.ch21

  50. [50]

    Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026

    Dilyorjon Solidjonov. Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026. ISSN 2520-1018. URLhttps://doi.org/10.1007/s41809-026-00200-5

  51. [51]

    Waggoner, Ryan Jewell, and Nicholas J

    Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the mturk quality crisis.Political Science Research and Methods, 8(4):614–629,

  52. [52]

    URLhttps://doi.org/10.1017/psrm.2020.6

  53. [53]

    Hager, Lorece V

    June Wang, Gabriela Calderon, Erin R. Hager, Lorece V . Edwards, Andrea A. Berry, Yisi Liu, Janny Dinh, August C. Summers, Katherine A. Connor, Megan E. Collins, Laura Prichett, Beth R. Marshall, and Sara B. Johnson. Identifying and preventing fraudulent responses in online public health surveys: Lessons learned during the covid-19 pandemic.PLOS Global Pu...

  54. [54]

    In: Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

    Prakash Shukla, Phuong Bui, Sean S Levy, Max Kowalski, Ali Baigelenov, and Paul Parsons. De-skilling, cognitive offloading, and misplaced responsibilities: Potential ironies of ai-assisted design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–7, New York, NY, USA, 2025. Association...

  55. [55]

    Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S

    Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. Llm agents grounded in self-reports enable general-purpose simulation of individuals, 2026. URL https://arxiv.org/abs/2411. 10109

  56. [56]

    Synthetic participants generated by large language 22 models: A systematic literature review, 2026

    Eduard Kuric, Peter Demcak, and Matus Krajcovic. Synthetic participants generated by large language 22 models: A systematic literature review, 2026. URLhttps://doi.org/10.21203/rs.3.rs-9057643/v1

  57. [57]

    Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025

    Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025. URL https://arxiv.org/abs/2505.09478

  58. [58]

    Systematic literature review of automation and artificial intelligence in usability issue detection, 2025

    Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. Systematic literature review of automation and artificial intelligence in usability issue detection, 2025. URLhttps://arxiv.org/abs/2504.01415

  59. [59]

    Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026

    Yu Zhang. Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026. ISSN 1572-8471. URLhttps://doi.org/10.1007/s10699-026-10030-x

  60. [60]

    Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025

    Matus Krajcovic, Peter Demcak, and Eduard Kuric. Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025. URL https://arxiv. org/abs/2508.02376

  61. [61]

    Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025

    Eduard Kuric, Peter Demcak, Jozef Majzel, and Giang Nguyen. Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025. ISSN 0952-1976. URLhttps://doi.org/10.1016/j.engappai.2025.110494

  62. [62]

    Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026

    Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026. URLhttps://doi.org/10.1080/10447318.2025.2603633

  63. [63]

    Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025

    Eduard Kuric, Peter Demcak, and Matus Krajcovic. Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025. ISSN 0950-5849. URLhttps://doi.org/10.1016/j.infsof.2025.107740. A Extended statistical results For extended statistical results...