What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience
Pith reviewed 2026-05-20 08:47 UTC · model grok-4.3
The pith
GPT fails to match human clicking patterns in more than half of real UX first-click tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Within twelve diverse first click tests obtained from real UX practice, GPT demonstrates critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes, with significantly different distributions from real data in 53% of tasks. Participant personas, chain-of-thought reasoning, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. The observed distortions in synthetic responses reduce their overall analytical usefulness as a decision-making resource compared with real data, which can be linked to the statistical nature of LLMs and their encoding of semantic heuristics from训练
What carries the argument
Comparison of GPT-generated first-click predictions and explanations against human participant data from twelve real-world UX tests, using statistical distribution differences to measure misalignment.
If this is right
- Synthetic data from GPT introduces multiple nuanced distortions that compromise the validity of UX research findings.
- Prompting techniques such as personas or chain-of-thought reasoning do not produce meaningful improvements in behavioral fidelity.
- The statistical and heuristic-driven properties of LLMs inherently restrict their capacity to simulate actual user interactions.
- Real human participant data remains necessary for trustworthy decision-making in user experience design.
Where Pith is reading between the lines
- AI simulation of human behavior in research may need training on actual interaction logs rather than text corpora alone to reach usable fidelity.
- The same limitations could affect LLM use in any domain that models human decision processes, such as market research or policy testing.
- Teams adopting synthetic UX testing should add targeted real-user checks to catch distortions before they influence product choices.
Load-bearing premise
The twelve first-click tests drawn from real UX practice are sufficiently diverse and representative to support general claims about GPT's misalignment with human behavior across user experience research.
What would settle it
Repeating the study on a new collection of first-click tests or other UX methods and finding that GPT matches real-user click distributions and reasoning in the majority of cases would falsify the central claim.
Figures
read the original abstract
Synthetic participants represent a methodologically concerning concept that threatens the integrity of UX research. Findings from previous experiments specify how AI outputs are misaligned with the behaviors and thoughts of real humans in various ways. However, industry voices keep underestimating their severity, advocating for practical compromises where good-enough data does not need to be perfect, and all issues will be solved by future tuning. Our study tackles the lack of systematic understanding of the practical issues that arise with synthetic behavior and its use for steering decisions within real contexts. Within twelve diverse first click tests (n = 3431) obtained from real UX practice, we examine the ability of GPT to predict where humans click and how they reason about their behavior. Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes. Participant personas, chain-of-thought reasoning in GPT, and different sampling parameters fail to create sensible fidelity improvements apart from inflating believability. We expose a multitude of nuanced distortions in synthetic responses that reduce their overall analytical usefulness as a decision-making resource, compared with real data. Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that GPT exhibits critical misalignment with human click behavior and reasoning in first-click UX tests. Across 12 tasks drawn from real practice (total n=3431 real users), GPT produces significantly different click distributions in 53% of cases; common mitigations such as participant personas and chain-of-thought prompting yield only marginal or illusory improvements in fidelity. The authors attribute these distortions to inherent LLM properties (statistical next-token prediction and training on linguistic data) and conclude that synthetic participants therefore threaten the integrity of UX research and decision-making.
Significance. If the empirical comparison holds, the work is significant for HCI because it supplies a sizable, multi-task, real-user baseline that quantifies practical distortions in click patterns and cognitive rationales. The large real-user sample (n=3431) and use of tasks obtained from actual UX practice are clear strengths that move the discussion beyond small-scale or purely synthetic evaluations. The study also supplies falsifiable, task-level predictions that can be checked against future human data.
major comments (2)
- [Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.
- [Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.
minor comments (2)
- [Results] A table or figure that lists per-task click-distribution statistics (e.g., chi-square values, p-values, effect sizes) alongside the real vs. synthetic comparison would improve readability of the central result.
- [Abstract / Introduction] The abstract and introduction could more explicitly name the GPT model version and release date used, as these details matter for reproducibility in a rapidly changing field.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. Their comments on methodological transparency are well-taken and have prompted us to strengthen the paper. We respond to each major comment below and indicate the changes we will incorporate in the revised version.
read point-by-point responses
-
Referee: [Methods / Study Design] The selection and diversity of the twelve first-click tests are described only as 'obtained from real UX practice' and 'diverse' with no enumeration of domains, interface types, user goals, sampling frame, or quantitative diversity metrics. This assumption is load-bearing for the headline claim that misalignment occurs in 53% of tasks and that synthetic data therefore has 'critical failures' for UX decision-making; without documented representativeness, the observed rate could be task-specific rather than diagnostic of LLM properties.
Authors: We appreciate the referee's emphasis on documenting task selection to support generalizability claims. The twelve tasks were drawn directly from professional UX testing archives to capture realistic first-click scenarios across common interface contexts. To address the concern, the revised manuscript will include an expanded Methods subsection with a summary table listing each task's domain (e.g., e-commerce checkout, news article navigation, SaaS dashboard), interface type (desktop web or mobile), primary user goal, and approximate complexity indicators such as number of visible elements. This addition will allow readers to evaluate representativeness directly and will clarify that the 53% rate reflects patterns observed across a range of practical tasks rather than a narrow subset. We maintain that the core finding of misalignment remains diagnostic of LLM properties, but the added documentation will reduce any ambiguity about task specificity. revision: yes
-
Referee: [Methods / Experimental Procedure] Exact prompting templates, temperature/sampling parameters, statistical thresholds for declaring 'significantly different distribution,' and participant exclusion criteria are not reported in sufficient detail. These omissions directly affect verifiability of the 53% figure and the conclusion that personas and chain-of-thought produce no sensible fidelity gains.
Authors: We agree that precise procedural details are necessary for reproducibility and for readers to assess the robustness of the 53% figure and the prompting results. The original submission provided a high-level description; the revised manuscript will expand the Experimental Procedure section to include the verbatim base prompt template, the exact persona and chain-of-thought variants, the temperature and sampling settings used (temperature = 1.0, top_p = 1.0 for the primary runs), the statistical test and threshold applied (chi-square goodness-of-fit tests with p < 0.05, Bonferroni-adjusted for the 12 tasks), and the real-user exclusion criteria (incomplete sessions, failed attention checks, or response times below a minimum threshold). These details will be presented in the main text with references to supplementary materials containing the full prompt strings. This revision will directly support verification of both the misalignment rates and the limited efficacy of the tested mitigations. revision: yes
Circularity Check
No significant circularity; empirical comparison to external human data.
full rationale
The paper's central claims rest on direct empirical comparisons between GPT outputs and real human click distributions collected from twelve first-click tests obtained from UX practice. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or prior author citations; the 53% misalignment rate and related distortions are computed against independent external benchmarks rather than derived from the paper's own inputs. The analysis of personas, chain-of-thought, and sampling parameters is likewise evaluated against the same external human data, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Statistical significance tests can reliably identify differences in click distributions between human and AI responses.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results (e.g., significantly different distribution from real data in 53% of tasks) demonstrate critical failures to reflect the patterns in which users click on visual elements and the underlying cognitive processes.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Observed distortions can be theoretically linked to the properties categorically inherent to LLMs: their statistical nature and encoding of semantic heuristics dependent on their training on linguistic data.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Critical artificial intelligence literacy for psychologists, Oct 2025
Olivia Guest and Iris van Rooij. Critical artificial intelligence literacy for psychologists, Oct 2025. URL https://doi.org/10.31234/osf.io/dkrgj_v1
-
[2]
Tanisha Jowsey, Virginia Braun, Victoria Clarke, Deborah Lupton, and Michelle Fine. We reject the use of generative artificial intelligence for reflexive qualitative research.Qualitative Inquiry, 0(0): 10778004251401851, 2025. URLhttps://doi.org/10.1177/10778004251401851
-
[3]
Sanders, Alex Ulinich, and Bruce Schneier
Nathan E. Sanders, Alex Ulinich, and Bruce Schneier. Demonstrations of the Potential of AI-based Political Issue Polling.Harvard Data Science Review, 5(4):23, oct 27 2023. URL https://doi.org/10.1162/ 99608f92.1d3cf75d
work page 2023
-
[4]
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URLhttps://arxiv.org/abs/2506.06941
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. Take caution in using llms as human surrogates.Proceedings of the National Academy of Sciences, 122(24):e2501660122, 2025. URL https: //doi.org/10.1073/pnas.2501660122. 19
-
[6]
Monika Imschloss, Marko Sarstedt, Susanne J. Adler, and Jun Hwa Cheah. Using llms in sensory service research: initial insights and perspectives.The Service Industries Journal, 0(0):1–22, 2025. URL https://doi.org/10.1080/02642069.2025.2479723
-
[7]
Marcel Binz and Eric Schulz. Using cognitive psychology to understand gpt-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023. URLhttps://doi.org/10.1073/pnas.2218523120
-
[8]
Grgur Kovaˇ c, Rémy Portelas, Masataka Sawayama, Peter Ford Dominey, and Pierre-Yves Oudeyer. Stick to your role! stability of personal values expressed in large language models.PLOS ONE, 19(8):1–20, 08
-
[9]
URLhttps://doi.org/10.1371/journal.pone.0309114
-
[10]
Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127
W. Craig Tomlin.UX and Usability Testing Data, chapter 7, pages 97–127. Apress, Berkeley, CA, 2018. ISBN 978-1-4842-3867-7. URLhttps://doi.org/10.1007/978-1-4842-3867-7_7
-
[11]
Matus Krajcovic, Peter Demcak, and Eduard Kuric. Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick?Journal of Systems and Software, 226:112446, 2025. ISSN 0164-1212. URLhttps://doi.org/10.1016/j.jss.2025.112446
-
[12]
Iskander Sanchez-Rola, Davide Balzarotti, Christopher Kruegel, Giovanni Vigna, and Igor Santos. Dirty clicks: A study of the usability and security implications of click-related behaviors on the web. In Proceedings of The Web Conference 2020, WWW ’20, page 395–406, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450370233. URLhttps:...
-
[13]
A study of first click behaviour and user interaction on the google serp
Chris Barry and Mark Lardner. A study of first click behaviour and user interaction on the google serp. In Jaroslav Pokorny, Vaclav Repa, Karel Richta, Wita Wojtkowski, Henry Linger, Chris Barry, and Michael Lang, editors,Information Systems Development, pages 89–99, New York, NY, 2011. Springer New York. ISBN 978-1-4419-9790-6
work page 2011
-
[14]
Web page graphic design usability testing en- hanced with eye-tracking
Piotr Chynał, Julia Falkowska, and Janusz Sobecki. Web page graphic design usability testing en- hanced with eye-tracking. In Waldemar Karwowski and Tareq Ahram, editors,Intelligent Human Systems Integration, pages 515–520, Cham, 2018. Springer International Publishing
work page 2018
-
[15]
Evaluating and analyzing click simulation in web search
Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. Evaluating and analyzing click simulation in web search. InProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, page 281–284, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450344906. URLhttps://doi.org/10.1145/312...
-
[16]
A simulation model of intermittently controlled point- and-click behaviour
Seungwon Do, Minsuk Chang, and Byungjoo Lee. A simulation model of intermittently controlled point- and-click behaviour. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, pages 1–17, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. URLhttps://doi.org/10.1145/3411764.3445514
-
[17]
Aleksandr Chuklin, Ilya Markov, and Maarten De Rijke.Click models for web search. Springer Nature, 2022. ISBN 9783031022944
work page 2022
-
[18]
Predicting and explaining mobile ui tappability with vision modeling and saliency analysis
Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–21, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391573. URLh...
-
[19]
Mikel Villamañe and Ainhoa Alvarez. Facilitating and automating usability testing of educational technologies.Computer Applications in Engineering Education, 32(3):e22725, 2024. URL https://doi.org/ 10.1002/cae.22725
-
[20]
Serene: a web platform for the ux semi-automatic evaluation of website
Andrea Esposito, Giuseppe Desolda, Rosa Lanzilotti, and Maria Francesca Costabile. Serene: a web platform for the ux semi-automatic evaluation of website. InProceedings of the 2022 International Conference on Advanced Visual Interfaces, AVI ’22, pages 1–3, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450397193. URLhttps://doi.org...
-
[21]
Automated usability tests for mobile devices through live emotions logging
Jackson Feijó Filho, Thiago Valle, and Wilson Prata. Automated usability tests for mobile devices through live emotions logging. InProceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, MobileHCI ’15, page 636–643, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 978145033...
-
[22]
Mingming Fan, Ke Wu, Jian Zhao, Yue Li, Winter Wei, and Khai N. Truong. Vista: Integrating machine intelligence with visualization to support the investigation of think-aloud sessions.IEEE Transactions on Visualization and Computer Graphics, 26(1):343–352, 2020. URL https://doi.org/10.1109/TVCG.2019. 2934797
-
[23]
Ehsan Jahangirzadeh Soure, Emily Kuang, Mingming Fan, and Jian Zhao. Coux: Collaborative visual 20 analysis of think-aloud usability test videos for digital interfaces.IEEE Transactions on Visualization and Computer Graphics, 28(1):643–653, 2022. URLhttps://doi.org/10.1109/TVCG.2021.3114822
-
[24]
Andrea Batch, Yipeng Ji, Mingming Fan, Jian Zhao, and Niklas Elmqvist. uxsense: Supporting user experience analysis with visualization and computer vision.IEEE Transactions on Visualization and Computer Graphics, 30(7):3841–3856, 2024. URLhttps://doi.org/10.1109/TVCG.2023.3241581
-
[25]
Wei Xiang, Hanfei Zhu, Suqi Lou, Xinli Chen, Zhenghua Pan, Yuping Jin, Shi Chen, and Lingyun Sun. Simuser: Generating usability feedback by simulating various users interacting with mobile applications. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, pages 1–17, New York, NY, USA, 2024. Association for Computing Ma...
-
[26]
Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–12, New York, NY, USA, 2025. Assoc...
-
[27]
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality- aware decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, pages 1–13, New York, NY, USA, 2024. Associati...
-
[28]
Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, Feb 2025. ISSN 1573-1405. URLhttps://doi.org/10.1007/s11263-024-02214-4
-
[29]
Guing: A mobile gui search engine using a vision-language model.ACM Trans
Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray, and Walid Maalej. Guing: A mobile gui search engine using a vision-language model.ACM Trans. Softw. Eng. Methodol., 34(4), April 2025. ISSN 1049-331X. URLhttps://doi.org/10.1145/3702993
-
[30]
SeeClick: Harnessing GUI grounding for advanced visual GUI agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, Bangkok,...
-
[31]
URL https://www.nature.com/articles/ s41586-024-07930-y
Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri, and José Hernández-Orallo. Larger and more instructable language models become less reliable.Nature, 634(8032): 61–68, 10 2024. URLhttps://doi.org/10.1038/s41586-024-07930-y
-
[32]
Yunting Liu, Shreya Bhandari, and Zachary A. Pardos. Leveraging llm respondents for item evaluation: A psychometric analysis.British Journal of Educational Technology, 56(3):1028–1052, 2025. URL https: //doi.org/10.1111/bjet.13570
-
[33]
Marco Gerosa, Bianca Trinkenreich, Igor Steinmacher, and Anita Sarma. Can ai serve as a substitute for human subjects in software engineering research?Automated Software Engineering, 31(1):13, 1 2024. ISSN 1573-7535. URLhttps://doi.org/10.1007/s10515-023-00409-6
-
[34]
PersonaLLM: Investigating the ability of large language models to express personality traits
Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 3605–3627, Mexico City, Mexico, June 2024. Associa...
-
[35]
Modeling human subjectivity in LLMs using explicit and implicit human factors in personas
Salvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Jane Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle Ungar, and Brenda Curtis. Modeling human subjectivity in LLMs using explicit and implicit human factors in personas. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2...
work page 2024
-
[36]
In: Duh, K., Gomez, H., Bethard, S
Association for Computational Linguistics. URL https://doi.org/10.18653/v1/2024.findings- emnlp.420
-
[37]
James Brand, Ayelet Israeli, and Donald Ngwe. Using llms for market research.Harvard business school marketing unit working paper, (23-062):60, 2023. URLhttp://dx.doi.org/10.2139/ssrn.4395751
-
[38]
Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo
Guilherme F.C.F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, and Marcelo de Araújo. Exploring the psychology of llms’ moral and legal reasoning.Artificial Intelligence, 333:104145, 2024. ISSN 21 0004-3702. URLhttps://doi.org/10.1016/j.artint.2024.104145
-
[39]
Kevin Ma, Daniele Grandi, Christopher McComb, and Kosa Goucher-Lambert. Do large language models produce diverse design concepts? a comparative study with human-crowdsourced solutions.Journal of Computing and Information Science in Engineering, 25(2):024501, 2025. URL https://doi.org/10.1115/1. 4067332
work page doi:10.1115/1 2025
-
[40]
Aliya Amirova, Theodora Fteropoulli, Nafiso Ahmed, Martin R. Cowie, and Joel Z. Leibo. Framework- based qualitative analysis of free responses of large language models: Algorithmic fidelity.PLOS ONE, 19(3):1–33, 03 2024. doi: 10.1371/journal.pone.0300024. URLhttps://doi.org/10.1371/journal.pone. 0300024
-
[41]
William Fleeson and Patrick Gallagher. The implications of big five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis.J Pers Soc Psychol, 97 (6):1097–1114, December 2009
work page 2009
-
[42]
Virginia Braun and Victoria Clarke. One size fits all? what counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology, 18(3):328–352, 2021. URL https://doi.org/10. 1080/14780887.2020.1769238
-
[43]
Using Thematic Analysis in Psychology
Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative Research in Psychology, 3(2):77–101, 2006. URLhttps://doi.org/10.1191/1478088706qp063oa
-
[44]
Ida Lumintu. Content-based recommendation engine using term frequency-inverse document frequency vectorization and cosine similarity: A case study. In2023 IEEE 9th Information Technology International Seminar (ITIS), pages 1–6, Batu Malang, Indonesia, 2023. URL https://doi.org/10.1109/ITIS59651. 2023.10420137
-
[45]
Pranav Narayanan Venkit, Jiayi Li, Yingfan Zhou, Sarah Rajtmajer, and Shomir Wilson. A tale of two identities: An ethical audit of ai-crafted synthetic personas.Proceedings of the AAAI Conference on Artificial Intelligence, 40(44):37765–37774, Mar. 2026. URLhttps://doi.org/10.1609/aaai.v40i44.41112
-
[46]
Jan Karem Höhne, Konstantin Gavras, and Joshua Claassen. Typing or speaking? comparing text and voice answers to open questions on sensitive topics in smartphone surveys.Social Science Computer Review, 42(4):1066–1085, 2024. URLhttps://doi.org/10.1177/08944393231160961
-
[47]
Arriaga, and Adam Tauman Kalai
Gati V Aher, Rosa I. Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans and replicate human subject studies. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings o...
work page 2023
-
[48]
Kent Bach.Pragmatics and the Philosophy of Language, chapter 21, pages 463–487. John Wiley & Sons, Ltd,
-
[49]
URLhttps://doi.org/10.1002/9780470756959.ch21
ISBN 9780470756959. URLhttps://doi.org/10.1002/9780470756959.ch21
-
[50]
Dilyorjon Solidjonov. Pragmatic competence without embodiment? evaluating llm performance on implicature, presupposition, and speech acts.Journal of Cultural Cognitive Science, Apr 2026. ISSN 2520-1018. URLhttps://doi.org/10.1007/s41809-026-00200-5
-
[51]
Waggoner, Ryan Jewell, and Nicholas J
Ryan Kennedy, Scott Clifford, Tyler Burleigh, Philip D. Waggoner, Ryan Jewell, and Nicholas J. G. Winter. The shape of and solutions to the mturk quality crisis.Political Science Research and Methods, 8(4):614–629,
-
[52]
URLhttps://doi.org/10.1017/psrm.2020.6
-
[53]
June Wang, Gabriela Calderon, Erin R. Hager, Lorece V . Edwards, Andrea A. Berry, Yisi Liu, Janny Dinh, August C. Summers, Katherine A. Connor, Megan E. Collins, Laura Prichett, Beth R. Marshall, and Sara B. Johnson. Identifying and preventing fraudulent responses in online public health surveys: Lessons learned during the covid-19 pandemic.PLOS Global Pu...
-
[54]
Prakash Shukla, Phuong Bui, Sean S Levy, Max Kowalski, Ali Baigelenov, and Paul Parsons. De-skilling, cognitive offloading, and misplaced responsibilities: Potential ironies of ai-assisted design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA ’25, pages 1–7, New York, NY, USA, 2025. Association...
-
[55]
Joon Sung Park, Carolyn Q. Zou, Jonne Kamphorst, Niles Egan, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Percy Liang, Robb Willer, and Michael S. Bernstein. Llm agents grounded in self-reports enable general-purpose simulation of individuals, 2026. URL https://arxiv.org/abs/2411. 10109
work page 2026
-
[56]
Synthetic participants generated by large language 22 models: A systematic literature review, 2026
Eduard Kuric, Peter Demcak, and Matus Krajcovic. Synthetic participants generated by large language 22 models: A systematic literature review, 2026. URLhttps://doi.org/10.21203/rs.3.rs-9057643/v1
-
[57]
Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting simulator: Augmenting design of logical information architectures with large language models, 2025. URL https://arxiv.org/abs/2505.09478
-
[58]
Eduard Kuric, Peter Demcak, Matus Krajcovic, and Jan Lang. Systematic literature review of automation and artificial intelligence in usability issue detection, 2025. URLhttps://arxiv.org/abs/2504.01415
-
[59]
Yu Zhang. Epistemic limits of hallucination mitigation in large language models.Foundations of Science, Mar 2026. ISSN 1572-8471. URLhttps://doi.org/10.1007/s10699-026-10030-x
-
[60]
Matus Krajcovic, Peter Demcak, and Eduard Kuric. Talking surveys: How photorealistic embodied conversational agents shape response quality, engagement, and satisfaction, 2025. URL https://arxiv. org/abs/2508.02376
-
[61]
Eduard Kuric, Peter Demcak, Jozef Majzel, and Giang Nguyen. Democratizing eye-tracking? appearance- based gaze estimation with improved attention branch.Engineering Applications of Artificial Intelligence, 149:110494, 2025. ISSN 0952-1976. URLhttps://doi.org/10.1016/j.engappai.2025.110494
-
[62]
Eduard Kuric, Peter Demcak, and Matus Krajcovic. Card sorting with fewer cards and the same mental models? a reexamination of an established practice.International Journal of Human–Computer Interaction, 0 (0):1–25, 2026. URLhttps://doi.org/10.1080/10447318.2025.2603633
-
[63]
Eduard Kuric, Peter Demcak, and Matus Krajcovic. Validation of information architecture: Cross- methodological comparison of tree testing variants and prototype user testing.Information and Software Technology, 183:107740, 2025. ISSN 0950-5849. URLhttps://doi.org/10.1016/j.infsof.2025.107740. A Extended statistical results For extended statistical results...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.