What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience

Eduard Kuric; Matus Krajcovic; Peter Demcak

arxiv: 2605.18302 · v1 · pith:ISMOV7G5new · submitted 2026-05-18 · 💻 cs.HC

What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience

Eduard Kuric , Peter Demcak , Matus Krajcovic This is my paper

Pith reviewed 2026-05-20 08:47 UTC · model grok-4.3

classification 💻 cs.HC

keywords synthetic participantsGPTfirst-click testsuser experiencebehavioral misalignmentLLM limitationsUX research

0 comments

The pith

GPT fails to match human clicking patterns in more than half of real UX first-click tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether GPT can serve as a reliable synthetic participant in user experience research by comparing its click predictions and reasoning to actual human data. Across twelve first-click tests involving over three thousand real users, the AI produces significantly different click distributions in 53 percent of cases and misses key cognitive aspects of user decisions. Even enhancements like detailed personas or step-by-step reasoning prompts do not fix the core mismatches, only making outputs seem more plausible. The authors link these problems to the fundamental way large language models process information from text rather than real behavioral experiences. This matters because relying on such synthetic data could lead to misguided product design choices in industry settings.

Core claim

Within twelve diverse first click tests obtained from real UX practice, GPT demonstrates critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes, with significantly different distributions from real data in 53% of tasks. Participant personas, chain-of-thought reasoning, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. The observed distortions in synthetic responses reduce their overall analytical usefulness as a decision-making resource compared with real data, which can be linked to the statistical nature of LLMs and their encoding of semantic heuristics from训练

What carries the argument

Comparison of GPT-generated first-click predictions and explanations against human participant data from twelve real-world UX tests, using statistical distribution differences to measure misalignment.

If this is right

Synthetic data from GPT introduces multiple nuanced distortions that compromise the validity of UX research findings.
Prompting techniques such as personas or chain-of-thought reasoning do not produce meaningful improvements in behavioral fidelity.
The statistical and heuristic-driven properties of LLMs inherently restrict their capacity to simulate actual user interactions.
Real human participant data remains necessary for trustworthy decision-making in user experience design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI simulation of human behavior in research may need training on actual interaction logs rather than text corpora alone to reach usable fidelity.
The same limitations could affect LLM use in any domain that models human decision processes, such as market research or policy testing.
Teams adopting synthetic UX testing should add targeted real-user checks to catch distortions before they influence product choices.

Load-bearing premise

The twelve first-click tests drawn from real UX practice are sufficiently diverse and representative to support general claims about GPT's misalignment with human behavior across user experience research.

What would settle it

Repeating the study on a new collection of first-click tests or other UX methods and finding that GPT matches real-user click distributions and reasoning in the majority of cases would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18302 by Eduard Kuric, Matus Krajcovic, Peter Demcak.

**Figure 1.** Figure 1: Simulation procedure flowchart. Instructions are meta-directives applied across studies, content represents study-specific instructions and stimuli from the original source, output defines format. We conducted the explorative qualitative portion of our analysis with the aim of explaining the identified patterns, as well as studying the additional properties of synthetic data that the quantitative perspecti… view at source ↗

**Figure 2.** Figure 2: Comparison of first click behavioral measures between simulation conditions, evaluated using data from all studies. whether the victor was actually the correct solution based on design intent or participant solutions. Since GPT is a blackbox, the method for selecting the victor is unknown, although we found that they were all elements with some semantic connection to the task. GPT did not apply this strate… view at source ↗

**Figure 3.** Figure 3: Line chart of Likert scale response distributions plotting all tasks to compare synthetic data (left) and real data (right). Statistically, score 7 synthetic SD = 8.68%, real SD = 22.58%; score 1 synthetic SD = 0.47%, real SD = 7.98%. Open-ended follow-ups typically focused on justifications of clicks and rating choices, impressions and 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Linguistic similarity between synthetic responses to open-ended questions and real data across the evaluated conditions [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Boxplot comparison between synthetic and real responses to open-ended follow-up questions aggregated across FCT studies. Our reflexive thematic analysis revealed several themes that capture distortions [45]: 4.2.1 Superficiality and homogeneity The deficient internal depth of GPT responses is an antecedent to the external lack of variability between them. The contents of GPT responses typically attributed … view at source ↗

read the original abstract

Synthetic participants represent a methodologically concerning concept that threatens the integrity of UX research. Findings from previous experiments specify how AI outputs are misaligned with the behaviors and thoughts of real humans in various ways. However, industry voices keep underestimating their severity, advocating for practical compromises where good-enough data does not need to be perfect, and all issues will be solved by future tuning. Our study tackles the lack of systematic understanding of the practical issues that arise with synthetic behavior and its use for steering decisions within real contexts. Within twelve diverse first click tests (n = 3431) obtained from real UX practice, we examine the ability of GPT to predict where humans click and how they reason about their behavior. Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes. Participant personas, chain-of-thought reasoning in GPT, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. We expose a multitude of nuanced distortions in synthetic responses that reduce their overall analytical usefulness as a decision-making resource, compared with real data. Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT click predictions diverge from real human data in over half the UX tasks tested, but the twelve tasks lack enough detail on selection to support broad claims about synthetic participants.

read the letter

The core finding is that GPT produces click distributions significantly different from real users in 53% of the twelve first-click tests, and standard tweaks like personas or chain-of-thought do not close the gap in any meaningful way. The authors tie this to basic LLM traits such as statistical next-token prediction rather than deeper cognitive modeling. That observation is not brand new, but running the comparison on actual UX tasks with 3431 real participants gives it a practical edge over purely theoretical critiques.

Referee Report

2 major / 2 minor

Summary. The paper claims that GPT exhibits critical misalignment with human click behavior and reasoning in first-click UX tests. Across 12 tasks drawn from real practice (total n=3431 real users), GPT produces significantly different click distributions in 53% of cases; common mitigations such as participant personas and chain-of-thought prompting yield only marginal or illusory improvements in fidelity. The authors attribute these distortions to inherent LLM properties (statistical next-token prediction and training on linguistic data) and conclude that synthetic participants therefore threaten the integrity of UX research and decision-making.

Significance. If the empirical comparison holds, the work is significant for HCI because it supplies a sizable, multi-task, real-user baseline that quantifies practical distortions in click patterns and cognitive rationales. The large real-user sample (n=3431) and use of tasks obtained from actual UX practice are clear strengths that move the discussion beyond small-scale or purely synthetic evaluations. The study also supplies falsifiable, task-level predictions that can be checked against future human data.

major comments (2)

[Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.
[Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.

minor comments (2)

[Results] A table or figure that lists per-task click-distribution statistics (e.g., chi-square values, p-values, effect sizes) alongside the real vs. synthetic comparison would improve readability of the central result.
[Abstract / Introduction] The abstract and introduction could more explicitly name the GPT model version and release date used, as these details matter for reproducibility in a rapidly changing field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. Their comments on methodological transparency are well-taken and have prompted us to strengthen the paper. We respond to each major comment below and indicate the changes we will incorporate in the revised version.

read point-by-point responses

Referee: [Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.

Authors: We appreciate the referee's emphasis on documenting task selection to support generalizability claims. The twelve tasks were drawn directly from professional UX testing archives to capture realistic first-click scenarios across common interface contexts. To address the concern, the revised manuscript will include an expanded Methods subsection with a summary table listing each task's domain (e.g., e-commerce checkout, news article navigation, SaaS dashboard), interface type (desktop web or mobile), primary user goal, and approximate complexity indicators such as number of visible elements. This addition will allow readers to evaluate representativeness directly and will clarify that the 53% rate reflects patterns observed across a range of practical tasks rather than a narrow subset. We maintain that the core finding of misalignment remains diagnostic of LLM properties, but the added documentation will reduce any ambiguity about task specificity. revision: yes
Referee: [Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.

Authors: We agree that precise procedural details are necessary for reproducibility and for readers to assess the robustness of the 53% figure and the prompting results. The original submission provided a high-level description; the revised manuscript will expand the Experimental Procedure section to include the verbatim base prompt template, the exact persona and chain-of-thought variants, the temperature and sampling settings used (temperature = 1.0, top_p = 1.0 for the primary runs), the statistical test and threshold applied (chi-square goodness-of-fit tests with p < 0.05, Bonferroni-adjusted for the 12 tasks), and the real-user exclusion criteria (incomplete sessions, failed attention checks, or response times below a minimum threshold). These details will be presented in the main text with references to supplementary materials containing the full prompt strings. This revision will directly support verification of both the misalignment rates and the limited efficacy of the tested mitigations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to external human data.

full rationale

The paper's central claims rest on direct empirical comparisons between GPT outputs and real human click distributions collected from twelve first-click tests obtained from UX practice. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or prior author citations; the 53% misalignment rate and related distortions are computed against independent external benchmarks rather than derived from the paper's own inputs. The analysis of personas, chain-of-thought, and sampling parameters is likewise evaluated against the same external human data, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and draws on standard statistical comparison methods plus the background assumption that real-user click data constitutes ground truth for UX behavior.

axioms (1)

standard math Statistical significance tests can reliably identify differences in click distributions between human and AI responses.
Used to support the claim of significant differences in 53% of tasks.

pith-pipeline@v0.9.0 · 5781 in / 1131 out tokens · 50741 ms · 2026-05-20T08:47:48.917422+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

[1]

Critical artificial intelligence literacy for psychologists, Oct 2025

Olivia Guest and Iris van Rooij. Critical artificial intelligence literacy for psychologists, Oct 2025. URL https://doi.org/10.31234/osf.io/dkrgj_v1

work page doi:10.31234/osf.io/dkrgj_v1 2025
[2]

We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025

Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, and Michelle Fine. We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025. URLhttps://doi.org/10.1177/10778004251401851

work page doi:10.1177/10778004251401851 2025
[3]

Sanders, Alex Ulinich, and Bruce Schneier

Nathan E. Sanders, Alex Ulinich, and Bruce Schneier. Demonstrations of the Potential of AI-based Political Issue Polling.Harvard Data Science Review, 5(4):23, oct 27 2023. URL https://doi.org/10.1162/ 99608f92.1d3cf75d

work page 2023
[4]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URLhttps://arxiv.org/abs/2506.06941

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025

Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025. URL https: //doi.org/10.1073/pnas.2501660122. 19

work page doi:10.1073/pnas.2501660122 2025
[6]

Adler, and Jun Hwa Cheah

Monika Imschloss, Marko Sarstedt, Susanne J. Adler, and Jun Hwa Cheah. Using llms in sensory service research: initial insights and perspectives.The Service Industries Journal, 0(0):1–22, 2025. URL https://doi.org/10.1080/02642069.2025.2479723

work page doi:10.1080/02642069.2025.2479723 2025
[7]

Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

Marcel Binz and Eric Schulz. Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023. URLhttps://doi.org/10.1073/pnas.2218523120

work page doi:10.1073/pnas.2218523120 2023
[8]

Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

Grgur Kovaˇ c, Rémy Portelas, Masataka Sawayama, Peter Ford Dominey, and Pierre-Yves Oudeyer. Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

work page
[9]

URLhttps://doi.org/10.1371/journal.pone.0309114

work page doi:10.1371/journal.pone.0309114
[10]

Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127

W. Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127. Apress, Berkeley, CA, 2018. ISBN 978-1-4842-3867-7. URLhttps://doi.org/10.1007/978-1-4842-3867-7_7

work page doi:10.1007/978-1-4842-3867-7_7 2018
[11]

Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025

Matus Krajcovic, Peter Demcak, and Eduard Kuric. Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025. ISSN 0164-1212. URLhttps://doi.org/10.1016/j.jss.2025.112446

work page doi:10.1016/j.jss.2025.112446 2025
[12]

Dirty clicks: A study of the usability and security implications of click-related behaviors on the web

Iskander Sanchez-Rola, Davide Balzarotti, Christopher Kruegel, Giovanni Vigna, and Igor Santos. Dirty clicks: A study of the usability and security implications of click-related behaviors on the web. In Proceedings of The Web Conference 2020, WWW ’20, page 395–406, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370233. URLhttps:...

work page doi:10.1145/3366423.3380124 2020
[13]

A study of first click behaviour and user interaction on the google serp

Chris Barry and Mark Lardner. A study of first click behaviour and user interaction on the google serp. In Jaroslav Pokorny, Vaclav Repa, Karel Richta, Wita Wojtkowski, Henry Linger, Chris Barry, and Michael Lang, editors,Information Systems Development, pages 89–99, New York, NY, 2011. Springer New York. ISBN 978-1-4419-9790-6

work page 2011
[14]

Web page graphic design usability testing en- hanced with eye-tracking

Piotr Chynał, Julia Falkowska, and Janusz Sobecki. Web page graphic design usability testing en- hanced with eye-tracking. In Waldemar Karwowski and Tareq Ahram, editors,Intelligent Human Systems Integration, pages 515–520, Cham, 2018. Springer International Publishing

work page 2018
[15]

Evaluating and analyzing click simulation in web search

Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. Evaluating and analyzing click simulation in web search. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, page 281–284, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450344906. URLhttps://doi.org/10.1145/312...

work page doi:10.1145/3121050.3121096 2017
[16]

A simulation model of intermittently controlled point- and-click behaviour

Seungwon Do, Minsuk Chang, and Byungjoo Lee. A simulation model of intermittently controlled point- and-click behaviour. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, pages 1–17, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. URLhttps://doi.org/10.1145/3411764.3445514

work page doi:10.1145/3411764.3445514 2021
[17]

Springer Nature, 2022

Aleksandr Chuklin, Ilya Markov, and Maarten De Rijke.Click models for web search. Springer Nature, 2022. ISBN 9783031022944

work page 2022
[18]

Predicting and explaining mobile ui tappability with vision modeling and saliency analysis

Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–21, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391573. URLh...

work page doi:10.1145/3491102 2022
[19]

Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024

Mikel Villamañe and Ainhoa Alvarez. Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024. URL https://doi.org/ 10.1002/cae.22725

work page doi:10.1002/cae.22725 2024
[20]

Serene: a web platform for the ux semi-automatic evaluation of website

Andrea Esposito, Giuseppe Desolda, Rosa Lanzilotti, and Maria Francesca Costabile. Serene: a web platform for the ux semi-automatic evaluation of website. InProceedings of the 2022 International Conference on Advanced Visual Interfaces, AVI ’22, pages 1–3, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397193. URLhttps://doi.org...

work page doi:10.1145/3531073.3534464 2022
[21]

Automated usability tests for mobile devices through live emotions logging

Jackson Feijó Filho, Thiago Valle, and Wilson Prata. Automated usability tests for mobile devices through live emotions logging. InProceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, MobileHCI ’15, page 636–643, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 978145033...

work page doi:10.1145/2786567.2792902 2015
[22]

Mingming Fan, Ke Wu, Jian Zhao, Yue Li, Winter Wei, and Khai N. Truong. Vista: Integrating machine intelligence with visualization to support the investigation of think-aloud sessions.IEEE Transactions on Visualization and Computer Graphics, 26(1):343–352, 2020. URL https://doi.org/10.1109/TVCG.2019. 2934797

work page doi:10.1109/tvcg.2019 2020
[23]

Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022

Ehsan Jahangirzadeh Soure, Emily Kuang, Mingming Fan, and Jian Zhao. Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022. URLhttps://doi.org/10.1109/TVCG.2021.3114822

work page doi:10.1109/tvcg.2021.3114822 2022
[24]

uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024

Andrea Batch, Yipeng Ji, Mingming Fan, Jian Zhao, and Niklas Elmqvist. uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024. URLhttps://doi.org/10.1109/TVCG.2023.3241581

work page doi:10.1109/tvcg.2023.3241581 2024
[25]

Simuser: Generating usability feedback by simulating various users interacting with mobile applications

Wei Xiang, Hanfei Zhu, Suqi Lou, Xinli Chen, Zhenghua Pan, Yuping Jin, Shi Chen, and Lingyun Sun. Simuser: Generating usability feedback by simulating various users interacting with mobile applications. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17, New York, NY, USA, 2024. Association for Computing Ma...

work page doi:10.1145/3613904.3642481 2024
[26]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12, New York, NY, USA, 2025. Assoc...

work page doi:10.1145/3706599.3719729 2025
[27]

Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, pages 1–13, New York, NY, USA, 2024. Associati...

work page doi:10.1145/3597503.3639180 2024
[28]

Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025. ISSN 1573-1405. URLhttps://doi.org/10.1007/s11263-024-02214-4

work page doi:10.1007/s11263-024-02214-4 2025
[29]

Guing: A mobile gui search engine using a vision-language model.ACM Trans

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, and Walid Maalej. Guing: A mobile gui search engine using a vision-language model.ACM Trans. Softw. Eng. Methodol., 34(4), April 2025. ISSN 1049-331X. URLhttps://doi.org/10.1145/3702993

work page doi:10.1145/3702993 2025
[30]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

work page doi:10.18653/v1/2024.acl-long.505 2024
[31]

URL https://www.nature.com/articles/ s41586-024-07930-y

Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo. Larger and more instructable language models become less reliable.Nature, 634(8032): 61–68, 10 2024. URLhttps://doi.org/10.1038/s41586-024-07930-y

work page doi:10.1038/s41586-024-07930-y 2024
[32]

Yunting Liu, Shreya Bhandari, and Zachary A. Pardos. Leveraging llm respondents for item evaluation: A psychometric analysis.British Journal of Educational Technology, 56(3):1028–1052, 2025. URL https: //doi.org/10.1111/bjet.13570

work page doi:10.1111/bjet.13570 2025
[33]

Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024

Marco Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024. ISSN 1573-7535. URLhttps://doi.org/10.1007/s10515-023-00409-6

work page doi:10.1007/s10515-023-00409-6 2024
[34]

PersonaLLM: Investigating the ability of large language models to express personality traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3605–3627, Mexico City, Mexico, June 2024. Associa...

work page doi:10.18653/v1/2024.findings-naacl.229 2024
[35]

Modeling human subjectivity in LLMs using explicit and implicit human factors in personas

Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Jane Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle Ungar, and Brenda Curtis. Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2...

work page 2024
[36]

In: Duh, K., Gomez, H., Bethard, S

Association for Computational Linguistics. URL https://doi.org/10.18653/v1/2024.findings- emnlp.420

work page doi:10.18653/v1/2024.findings- 2024
[37]

Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023

James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023. URLhttp://dx.doi.org/10.2139/ssrn.4395751

work page doi:10.2139/ssrn.4395751 2023
[38]

Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo

Guilherme F.C.F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of llms’ moral and legal reasoning.Artificial Intelligence, 333:104145, 2024. ISSN 21 0004-3702. URLhttps://doi.org/10.1016/j.artint.2024.104145

work page doi:10.1016/j.artint.2024.104145 2024
[39]

Kevin Ma, Daniele Grandi, Christopher McComb, and Kosa Goucher-Lambert. Do large language models produce diverse design concepts? a comparative study with human-crowdsourced solutions.Journal of Computing and Information Science in Engineering, 25(2):024501, 2025. URL https://doi.org/10.1115/1. 4067332

work page doi:10.1115/1 2025
[40]

Cowie, and Joel Z

Aliya Amirova, Theodora Fteropoulli, Nafiso Ahmed, Martin R. Cowie, and Joel Z. Leibo. Framework- based qualitative analysis of free responses of large language models: Algorithmic fidelity.PLOS ONE, 19(3):1–33, 03 2024. doi: 10.1371/journal.pone.0300024. URLhttps://doi.org/10.1371/journal.pone. 0300024

work page doi:10.1371/journal.pone.0300024 2024
[41]

William Fleeson and Patrick Gallagher. The implications of big five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis.J Pers Soc Psychol, 97 (6):1097–1114, December 2009

work page 2009
[42]

One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021

Virginia Braun and Victoria Clarke. One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021. URL https://doi.org/10. 1080/14780887.2020.1769238

work page arXiv 2021
[43]

Using Thematic Analysis in Psychology

Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative Research in Psychology, 3(2):77–101, 2006. URLhttps://doi.org/10.1191/1478088706qp063oa

work page doi:10.1191/1478088706qp063oa 2006
[44]

Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study

Ida Lumintu. Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study. In2023 IEEE 9th Information Technology International Seminar (ITIS), pages 1–6, Batu Malang, Indonesia, 2023. URL https://doi.org/10.1109/ITIS59651. 2023.10420137

work page doi:10.1109/itis59651 2023
[45]

A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar

Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Rajtmajer, and Shomir Wilson. A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar. 2026. URLhttps://doi.org/10.1609/aaai.v40i44.41112

work page doi:10.1609/aaai.v40i44.41112 2026
[46]

Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024

Jan Karem Höhne, Konstantin Gavras, and Joshua Claassen. Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024. URLhttps://doi.org/10.1177/08944393231160961

work page doi:10.1177/08944393231160961 2024
[47]

Arriaga, and Adam Tauman Kalai

Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

work page 2023
[48]

John Wiley & Sons, Ltd,

Kent Bach.Pragmatics and the Philosophy of Language, chapter 21, pages 463–487. John Wiley & Sons, Ltd,

work page
[49]

URLhttps://doi.org/10.1002/9780470756959.ch21

ISBN 9780470756959. URLhttps://doi.org/10.1002/9780470756959.ch21

work page doi:10.1002/9780470756959.ch21
[50]

Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026

Dilyorjon Solidjonov. Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026. ISSN 2520-1018. URLhttps://doi.org/10.1007/s41809-026-00200-5

work page doi:10.1007/s41809-026-00200-5 2026
[51]

Waggoner, Ryan Jewell, and Nicholas J

Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the mturk quality crisis.Political Science Research and Methods, 8(4):614–629,

work page
[52]

URLhttps://doi.org/10.1017/psrm.2020.6

work page doi:10.1017/psrm.2020.6 2020
[53]

Hager, Lorece V

June Wang, Gabriela Calderon, Erin R. Hager, Lorece V . Edwards, Andrea A. Berry, Yisi Liu, Janny Dinh, August C. Summers, Katherine A. Connor, Megan E. Collins, Laura Prichett, Beth R. Marshall, and Sara B. Johnson. Identifying and preventing fraudulent responses in online public health surveys: Lessons learned during the covid-19 pandemic.PLOS Global Pu...

work page doi:10.1371/journal.pgph.0001452 2023
[54]

In: Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Prakash Shukla, Phuong Bui, Sean S Levy, Max Kowalski, Ali Baigelenov, and Paul Parsons. De-skilling, cognitive offloading, and misplaced responsibilities: Potential ironies of ai-assisted design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–7, New York, NY, USA, 2025. Association...

work page doi:10.1145/3706599.3719931 2025
[55]

Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S

Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. Llm agents grounded in self-reports enable general-purpose simulation of individuals, 2026. URL https://arxiv.org/abs/2411. 10109

work page 2026
[56]

Synthetic participants generated by large language 22 models: A systematic literature review, 2026

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Synthetic participants generated by large language 22 models: A systematic literature review, 2026. URLhttps://doi.org/10.21203/rs.3.rs-9057643/v1

work page doi:10.21203/rs.3.rs-9057643/v1 2026
[57]

Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025. URL https://arxiv.org/abs/2505.09478

work page arXiv 2025
[58]

Systematic literature review of automation and artificial intelligence in usability issue detection, 2025

Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. Systematic literature review of automation and artificial intelligence in usability issue detection, 2025. URLhttps://arxiv.org/abs/2504.01415

work page arXiv 2025
[59]

Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026

Yu Zhang. Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026. ISSN 1572-8471. URLhttps://doi.org/10.1007/s10699-026-10030-x

work page doi:10.1007/s10699-026-10030-x 2026
[60]

Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025

Matus Krajcovic, Peter Demcak, and Eduard Kuric. Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025. URL https://arxiv. org/abs/2508.02376

work page arXiv 2025
[61]

Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025

Eduard Kuric, Peter Demcak, Jozef Majzel, and Giang Nguyen. Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025. ISSN 0952-1976. URLhttps://doi.org/10.1016/j.engappai.2025.110494

work page doi:10.1016/j.engappai.2025.110494 2025
[62]

Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026. URLhttps://doi.org/10.1080/10447318.2025.2603633

work page doi:10.1080/10447318.2025.2603633 2026
[63]

Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025. ISSN 0950-5849. URLhttps://doi.org/10.1016/j.infsof.2025.107740. A Extended statistical results For extended statistical results...

work page doi:10.1016/j.infsof.2025.107740 2025

[1] [1]

Critical artificial intelligence literacy for psychologists, Oct 2025

Olivia Guest and Iris van Rooij. Critical artificial intelligence literacy for psychologists, Oct 2025. URL https://doi.org/10.31234/osf.io/dkrgj_v1

work page doi:10.31234/osf.io/dkrgj_v1 2025

[2] [2]

We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025

Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, and Michelle Fine. We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025. URLhttps://doi.org/10.1177/10778004251401851

work page doi:10.1177/10778004251401851 2025

[3] [3]

Sanders, Alex Ulinich, and Bruce Schneier

Nathan E. Sanders, Alex Ulinich, and Bruce Schneier. Demonstrations of the Potential of AI-based Political Issue Polling.Harvard Data Science Review, 5(4):23, oct 27 2023. URL https://doi.org/10.1162/ 99608f92.1d3cf75d

work page 2023

[4] [4]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URLhttps://arxiv.org/abs/2506.06941

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025

Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025. URL https: //doi.org/10.1073/pnas.2501660122. 19

work page doi:10.1073/pnas.2501660122 2025

[6] [6]

Adler, and Jun Hwa Cheah

Monika Imschloss, Marko Sarstedt, Susanne J. Adler, and Jun Hwa Cheah. Using llms in sensory service research: initial insights and perspectives.The Service Industries Journal, 0(0):1–22, 2025. URL https://doi.org/10.1080/02642069.2025.2479723

work page doi:10.1080/02642069.2025.2479723 2025

[7] [7]

Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

Marcel Binz and Eric Schulz. Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023. URLhttps://doi.org/10.1073/pnas.2218523120

work page doi:10.1073/pnas.2218523120 2023

[8] [8]

Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

Grgur Kovaˇ c, Rémy Portelas, Masataka Sawayama, Peter Ford Dominey, and Pierre-Yves Oudeyer. Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08

work page

[9] [9]

URLhttps://doi.org/10.1371/journal.pone.0309114

work page doi:10.1371/journal.pone.0309114

[10] [10]

Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127

W. Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127. Apress, Berkeley, CA, 2018. ISBN 978-1-4842-3867-7. URLhttps://doi.org/10.1007/978-1-4842-3867-7_7

work page doi:10.1007/978-1-4842-3867-7_7 2018

[11] [11]

Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025

Matus Krajcovic, Peter Demcak, and Eduard Kuric. Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025. ISSN 0164-1212. URLhttps://doi.org/10.1016/j.jss.2025.112446

work page doi:10.1016/j.jss.2025.112446 2025

[12] [12]

Dirty clicks: A study of the usability and security implications of click-related behaviors on the web

Iskander Sanchez-Rola, Davide Balzarotti, Christopher Kruegel, Giovanni Vigna, and Igor Santos. Dirty clicks: A study of the usability and security implications of click-related behaviors on the web. In Proceedings of The Web Conference 2020, WWW ’20, page 395–406, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370233. URLhttps:...

work page doi:10.1145/3366423.3380124 2020

[13] [13]

A study of first click behaviour and user interaction on the google serp

Chris Barry and Mark Lardner. A study of first click behaviour and user interaction on the google serp. In Jaroslav Pokorny, Vaclav Repa, Karel Richta, Wita Wojtkowski, Henry Linger, Chris Barry, and Michael Lang, editors,Information Systems Development, pages 89–99, New York, NY, 2011. Springer New York. ISBN 978-1-4419-9790-6

work page 2011

[14] [14]

Web page graphic design usability testing en- hanced with eye-tracking

Piotr Chynał, Julia Falkowska, and Janusz Sobecki. Web page graphic design usability testing en- hanced with eye-tracking. In Waldemar Karwowski and Tareq Ahram, editors,Intelligent Human Systems Integration, pages 515–520, Cham, 2018. Springer International Publishing

work page 2018

[15] [15]

Evaluating and analyzing click simulation in web search

Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. Evaluating and analyzing click simulation in web search. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, page 281–284, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450344906. URLhttps://doi.org/10.1145/312...

work page doi:10.1145/3121050.3121096 2017

[16] [16]

A simulation model of intermittently controlled point- and-click behaviour

Seungwon Do, Minsuk Chang, and Byungjoo Lee. A simulation model of intermittently controlled point- and-click behaviour. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, pages 1–17, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. URLhttps://doi.org/10.1145/3411764.3445514

work page doi:10.1145/3411764.3445514 2021

[17] [17]

Springer Nature, 2022

Aleksandr Chuklin, Ilya Markov, and Maarten De Rijke.Click models for web search. Springer Nature, 2022. ISBN 9783031022944

work page 2022

[18] [18]

Predicting and explaining mobile ui tappability with vision modeling and saliency analysis

Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–21, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391573. URLh...

work page doi:10.1145/3491102 2022

[19] [19]

Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024

Mikel Villamañe and Ainhoa Alvarez. Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024. URL https://doi.org/ 10.1002/cae.22725

work page doi:10.1002/cae.22725 2024

[20] [20]

Serene: a web platform for the ux semi-automatic evaluation of website

Andrea Esposito, Giuseppe Desolda, Rosa Lanzilotti, and Maria Francesca Costabile. Serene: a web platform for the ux semi-automatic evaluation of website. InProceedings of the 2022 International Conference on Advanced Visual Interfaces, AVI ’22, pages 1–3, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397193. URLhttps://doi.org...

work page doi:10.1145/3531073.3534464 2022

[21] [21]

Automated usability tests for mobile devices through live emotions logging

Jackson Feijó Filho, Thiago Valle, and Wilson Prata. Automated usability tests for mobile devices through live emotions logging. InProceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, MobileHCI ’15, page 636–643, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 978145033...

work page doi:10.1145/2786567.2792902 2015

[22] [22]

Mingming Fan, Ke Wu, Jian Zhao, Yue Li, Winter Wei, and Khai N. Truong. Vista: Integrating machine intelligence with visualization to support the investigation of think-aloud sessions.IEEE Transactions on Visualization and Computer Graphics, 26(1):343–352, 2020. URL https://doi.org/10.1109/TVCG.2019. 2934797

work page doi:10.1109/tvcg.2019 2020

[23] [23]

Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022

Ehsan Jahangirzadeh Soure, Emily Kuang, Mingming Fan, and Jian Zhao. Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022. URLhttps://doi.org/10.1109/TVCG.2021.3114822

work page doi:10.1109/tvcg.2021.3114822 2022

[24] [24]

uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024

Andrea Batch, Yipeng Ji, Mingming Fan, Jian Zhao, and Niklas Elmqvist. uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024. URLhttps://doi.org/10.1109/TVCG.2023.3241581

work page doi:10.1109/tvcg.2023.3241581 2024

[25] [25]

Simuser: Generating usability feedback by simulating various users interacting with mobile applications

Wei Xiang, Hanfei Zhu, Suqi Lou, Xinli Chen, Zhenghua Pan, Yuping Jin, Shi Chen, and Lingyun Sun. Simuser: Generating usability feedback by simulating various users interacting with mobile applications. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17, New York, NY, USA, 2024. Association for Computing Ma...

work page doi:10.1145/3613904.3642481 2024

[26] [26]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12, New York, NY, USA, 2025. Assoc...

work page doi:10.1145/3706599.3719729 2025

[27] [27]

Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, pages 1–13, New York, NY, USA, 2024. Associati...

work page doi:10.1145/3597503.3639180 2024

[28] [28]

Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025. ISSN 1573-1405. URLhttps://doi.org/10.1007/s11263-024-02214-4

work page doi:10.1007/s11263-024-02214-4 2025

[29] [29]

Guing: A mobile gui search engine using a vision-language model.ACM Trans

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, and Walid Maalej. Guing: A mobile gui search engine using a vision-language model.ACM Trans. Softw. Eng. Methodol., 34(4), April 2025. ISSN 1049-331X. URLhttps://doi.org/10.1145/3702993

work page doi:10.1145/3702993 2025

[30] [30]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...

work page doi:10.18653/v1/2024.acl-long.505 2024

[31] [31]

URL https://www.nature.com/articles/ s41586-024-07930-y

Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo. Larger and more instructable language models become less reliable.Nature, 634(8032): 61–68, 10 2024. URLhttps://doi.org/10.1038/s41586-024-07930-y

work page doi:10.1038/s41586-024-07930-y 2024

[32] [32]

Yunting Liu, Shreya Bhandari, and Zachary A. Pardos. Leveraging llm respondents for item evaluation: A psychometric analysis.British Journal of Educational Technology, 56(3):1028–1052, 2025. URL https: //doi.org/10.1111/bjet.13570

work page doi:10.1111/bjet.13570 2025

[33] [33]

Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024

Marco Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024. ISSN 1573-7535. URLhttps://doi.org/10.1007/s10515-023-00409-6

work page doi:10.1007/s10515-023-00409-6 2024

[34] [34]

PersonaLLM: Investigating the ability of large language models to express personality traits

Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3605–3627, Mexico City, Mexico, June 2024. Associa...

work page doi:10.18653/v1/2024.findings-naacl.229 2024

[35] [35]

Modeling human subjectivity in LLMs using explicit and implicit human factors in personas

Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Jane Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle Ungar, and Brenda Curtis. Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2...

work page 2024

[36] [36]

In: Duh, K., Gomez, H., Bethard, S

Association for Computational Linguistics. URL https://doi.org/10.18653/v1/2024.findings- emnlp.420

work page doi:10.18653/v1/2024.findings- 2024

[37] [37]

Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023

James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023. URLhttp://dx.doi.org/10.2139/ssrn.4395751

work page doi:10.2139/ssrn.4395751 2023

[38] [38]

Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo

Guilherme F.C.F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of llms’ moral and legal reasoning.Artificial Intelligence, 333:104145, 2024. ISSN 21 0004-3702. URLhttps://doi.org/10.1016/j.artint.2024.104145

work page doi:10.1016/j.artint.2024.104145 2024

[39] [39]

Kevin Ma, Daniele Grandi, Christopher McComb, and Kosa Goucher-Lambert. Do large language models produce diverse design concepts? a comparative study with human-crowdsourced solutions.Journal of Computing and Information Science in Engineering, 25(2):024501, 2025. URL https://doi.org/10.1115/1. 4067332

work page doi:10.1115/1 2025

[40] [40]

Cowie, and Joel Z

Aliya Amirova, Theodora Fteropoulli, Nafiso Ahmed, Martin R. Cowie, and Joel Z. Leibo. Framework- based qualitative analysis of free responses of large language models: Algorithmic fidelity.PLOS ONE, 19(3):1–33, 03 2024. doi: 10.1371/journal.pone.0300024. URLhttps://doi.org/10.1371/journal.pone. 0300024

work page doi:10.1371/journal.pone.0300024 2024

[41] [41]

William Fleeson and Patrick Gallagher. The implications of big five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis.J Pers Soc Psychol, 97 (6):1097–1114, December 2009

work page 2009

[42] [42]

One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021

Virginia Braun and Victoria Clarke. One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021. URL https://doi.org/10. 1080/14780887.2020.1769238

work page arXiv 2021

[43] [43]

Using Thematic Analysis in Psychology

Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative Research in Psychology, 3(2):77–101, 2006. URLhttps://doi.org/10.1191/1478088706qp063oa

work page doi:10.1191/1478088706qp063oa 2006

[44] [44]

Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study

Ida Lumintu. Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study. In2023 IEEE 9th Information Technology International Seminar (ITIS), pages 1–6, Batu Malang, Indonesia, 2023. URL https://doi.org/10.1109/ITIS59651. 2023.10420137

work page doi:10.1109/itis59651 2023

[45] [45]

A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar

Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Rajtmajer, and Shomir Wilson. A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar. 2026. URLhttps://doi.org/10.1609/aaai.v40i44.41112

work page doi:10.1609/aaai.v40i44.41112 2026

[46] [46]

Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024

Jan Karem Höhne, Konstantin Gavras, and Joshua Claassen. Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024. URLhttps://doi.org/10.1177/08944393231160961

work page doi:10.1177/08944393231160961 2024

[47] [47]

Arriaga, and Adam Tauman Kalai

Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...

work page 2023

[48] [48]

John Wiley & Sons, Ltd,

Kent Bach.Pragmatics and the Philosophy of Language, chapter 21, pages 463–487. John Wiley & Sons, Ltd,

work page

[49] [49]

URLhttps://doi.org/10.1002/9780470756959.ch21

ISBN 9780470756959. URLhttps://doi.org/10.1002/9780470756959.ch21

work page doi:10.1002/9780470756959.ch21

[50] [50]

Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026

Dilyorjon Solidjonov. Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026. ISSN 2520-1018. URLhttps://doi.org/10.1007/s41809-026-00200-5

work page doi:10.1007/s41809-026-00200-5 2026

[51] [51]

Waggoner, Ryan Jewell, and Nicholas J

Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the mturk quality crisis.Political Science Research and Methods, 8(4):614–629,

work page

[52] [52]

URLhttps://doi.org/10.1017/psrm.2020.6

work page doi:10.1017/psrm.2020.6 2020

[53] [53]

Hager, Lorece V

June Wang, Gabriela Calderon, Erin R. Hager, Lorece V . Edwards, Andrea A. Berry, Yisi Liu, Janny Dinh, August C. Summers, Katherine A. Connor, Megan E. Collins, Laura Prichett, Beth R. Marshall, and Sara B. Johnson. Identifying and preventing fraudulent responses in online public health surveys: Lessons learned during the covid-19 pandemic.PLOS Global Pu...

work page doi:10.1371/journal.pgph.0001452 2023

[54] [54]

In: Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

Prakash Shukla, Phuong Bui, Sean S Levy, Max Kowalski, Ali Baigelenov, and Paul Parsons. De-skilling, cognitive offloading, and misplaced responsibilities: Potential ironies of ai-assisted design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–7, New York, NY, USA, 2025. Association...

work page doi:10.1145/3706599.3719931 2025

[55] [55]

Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S

Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. Llm agents grounded in self-reports enable general-purpose simulation of individuals, 2026. URL https://arxiv.org/abs/2411. 10109

work page 2026

[56] [56]

Synthetic participants generated by large language 22 models: A systematic literature review, 2026

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Synthetic participants generated by large language 22 models: A systematic literature review, 2026. URLhttps://doi.org/10.21203/rs.3.rs-9057643/v1

work page doi:10.21203/rs.3.rs-9057643/v1 2026

[57] [57]

Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025. URL https://arxiv.org/abs/2505.09478

work page arXiv 2025

[58] [58]

Systematic literature review of automation and artificial intelligence in usability issue detection, 2025

Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. Systematic literature review of automation and artificial intelligence in usability issue detection, 2025. URLhttps://arxiv.org/abs/2504.01415

work page arXiv 2025

[59] [59]

Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026

Yu Zhang. Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026. ISSN 1572-8471. URLhttps://doi.org/10.1007/s10699-026-10030-x

work page doi:10.1007/s10699-026-10030-x 2026

[60] [60]

Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025

Matus Krajcovic, Peter Demcak, and Eduard Kuric. Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025. URL https://arxiv. org/abs/2508.02376

work page arXiv 2025

[61] [61]

Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025

Eduard Kuric, Peter Demcak, Jozef Majzel, and Giang Nguyen. Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025. ISSN 0952-1976. URLhttps://doi.org/10.1016/j.engappai.2025.110494

work page doi:10.1016/j.engappai.2025.110494 2025

[62] [62]

Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026. URLhttps://doi.org/10.1080/10447318.2025.2603633

work page doi:10.1080/10447318.2025.2603633 2026

[63] [63]

Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025

Eduard Kuric, Peter Demcak, and Matus Krajcovic. Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025. ISSN 0950-5849. URLhttps://doi.org/10.1016/j.infsof.2025.107740. A Extended statistical results For extended statistical results...

work page doi:10.1016/j.infsof.2025.107740 2025