Distorted Perspectives of LLM-Simulated Preferences: Can AI Mislead Design?
Pith reviewed 2026-05-20 08:43 UTC · model grok-4.3
The pith
LLM simulations of design preferences diverge systematically from real user choices across multiple setups.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aggregated data from twenty-nine real preference tests (n = 2073) show significant and systematic discrepancies with LLM outputs; the mismatches remain stable when the model is altered in reasoning depth, sampling strategy, persona framing, or prompt specificity. LLM justifications substitute genuine nuance with patterns such as emphasis on generic visual properties, attention to isolated elements, unnecessary elaboration, and overpraising.
What carries the argument
Holistic multimodal simulation of preference-test stimuli, with controlled manipulation of LLM variables (reasoning, sampling, persona, specificity) to quantify alignment against real-user aggregates.
If this is right
- Design teams that substitute LLM feedback for human testing risk creating interfaces that real users rate lower on preference measures.
- LLM-generated design critiques tend to lack the balanced, context-specific reasoning that human participants provide.
- Any automated pipeline that relies on current LLM preference simulation will inherit the same systematic biases observed here.
- Patterns such as overpraising and generic focus can be used as diagnostic signals to flag low-fidelity LLM outputs in design workflows.
Where Pith is reading between the lines
- Designers could treat LLM output as a low-cost first pass that still requires targeted human checks on the specific dimensions where mismatches are largest.
- The same simulation approach might be applied to other subjective judgments, such as content appeal or brand perception, to test whether similar distortions appear.
- If the root cause lies in training-data coverage of visual design judgments, targeted fine-tuning on large preference datasets could narrow the observed gaps.
Load-bearing premise
The aggregated preference data from the UXtweak platform accurately reflects unbiased user choices without platform-specific selection effects or test-format artifacts.
What would settle it
A new set of preference tests collected outside the original platform, using different recruitment and response formats, that produces LLM outputs closely matching the human distribution would undermine the claim of persistent discrepancies.
Figures
read the original abstract
Designers of digital solutions increasingly consult Large Language Models (LLMs) for their work. However, it remains unclear how this may affect the user experiences they produce and there are no established practices. We investigate how design preferences expressed by LLM-driven simulation methods align with those of real users. We present a study that aggregates real-world data and design stimuli from twenty-nine preference tests conducted in practice by users of the UXtweak online research platform (n = 2073). We perform holistic multimodal simulations where we manipulate LLM variables (model reasoning, sampling, persona type, and specificity) and assess their effects on algorithmic fidelity. Our results unveil significant and systematic discrepancies between peoples' real design preferences and LLM simulations that are consistent across manipulations. Synthetic justifications lack genuine depth, nuance and reasoning, which they substitute by patterns like focus on generic properties, specific elements, elaboration and overpraising. The unique attention directed by this research toward preferences within visual design stimuli highlights misrepresentation of perception and meaning by LLMs in a context that is intuitive yet critical for design teams. The external and ecological validity of our findings is high, given their replication across a multitude of real-world studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript aggregates real-world design preference data from 29 UXtweak platform tests (n=2073) and compares them to LLM-driven multimodal simulations that systematically vary model reasoning, sampling, persona type, and specificity. It reports significant, manipulation-consistent discrepancies between real and simulated preferences, along with qualitative patterns in LLM justifications (generic focus, element-specific elaboration, overpraising) that lack depth or nuance. The work emphasizes high ecological validity from cross-study replication and highlights risks for design teams using LLMs to simulate user perception of visual stimuli.
Significance. If the discrepancies are robust to alternative elicitation methods, the findings would caution against direct substitution of LLM simulations for real-user preference testing in visual design, particularly given the multi-study scale and explicit manipulation of LLM variables. The external grounding in independent platform data and the focus on algorithmic fidelity in an applied HCI context add practical value beyond purely synthetic evaluations.
major comments (1)
- Methods / Study Design: The central claim attributes observed discrepancies to LLM limitations after treating the aggregated UXtweak preference tests as an unbiased ground truth for 'peoples’ real design preferences.' No explicit controls, sensitivity analyses, or discussion address platform selection effects (self-selected digital-savvy participants) or test-format artifacts (forced-choice visual stimuli), leaving open the possibility that these factors contribute to or drive the reported misalignment rather than LLM behavior alone.
minor comments (2)
- Abstract: The claim of 'significant and systematic discrepancies' would benefit from a brief statement of the exact discrepancy metric (e.g., choice agreement rate, rank correlation) and any statistical controls applied across the 29 studies.
- Results: The description of post-hoc coding of justification patterns (generic properties, specific elements, elaboration, overpraising) should include inter-coder reliability or a reproducible coding scheme to support the qualitative claims.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which highlight important considerations for interpreting our real-world benchmark data. We address the major comment below and will incorporate revisions to clarify the scope of our findings.
read point-by-point responses
-
Referee: The central claim attributes observed discrepancies to LLM limitations after treating the aggregated UXtweak preference tests as an unbiased ground truth for 'peoples’ real design preferences.' No explicit controls, sensitivity analyses, or discussion address platform selection effects (self-selected digital-savvy participants) or test-format artifacts (forced-choice visual stimuli), leaving open the possibility that these factors contribute to or drive the reported misalignment rather than LLM behavior alone.
Authors: We agree that the manuscript would benefit from greater explicitness on this point. The UXtweak data is presented as an ecologically valid aggregation of real design preference tests rather than a universally unbiased ground truth for all people's preferences. To address the referee's concern, we will add a dedicated 'Limitations' subsection in the Discussion that discusses platform self-selection (e.g., digitally engaged participants) and forced-choice format effects as potential influences on the observed distributions. We will also note the consistency of discrepancies across the 29 independent studies as partial evidence of robustness, though we did not perform formal sensitivity analyses focused on these artifacts. We maintain that the core finding—systematic misalignment between LLM simulations and real aggregated preferences—remains informative for design practice even if the real data carries context-specific characteristics, but we will revise the text to avoid any implication of universal ground truth. revision: yes
Circularity Check
No circularity: empirical comparison to independent external platform data
full rationale
The paper conducts an empirical study by aggregating real-world preference test data from 29 studies on the independent UXtweak platform (n=2073) and directly comparing it against LLM simulations under manipulated variables. No mathematical derivations, equations, fitted parameters, or self-citations are used to generate the central results; the discrepancies are measured against external user data rather than being constructed from the study's own inputs or prior author work. The analysis is therefore self-contained against external benchmarks with no reduction of outputs to inputs by definition or fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Aggregated preference data from multiple real-world tests can be treated as a reliable proxy for general user design preferences.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation
Yoon, Se-eun and He, Zhankui and Echterhoff, Jessica and McAuley, Julian. Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024
work page 2024
-
[2]
and Schoenegger, Philipp and Zhu, Chongyang , title=
Park, Peter S. and Schoenegger, Philipp and Zhu, Chongyang , title=. Behavior Research Methods , year=
-
[3]
Correcting Systematic Bias in LLM-Generated Dialogues Using Big Five Personality Traits , year=
Sparrenberg, Lorenz and Schneider, Tobias and Deußer, Tobias and Koppenborg, Markus and Sifa, Rafet , booktitle=. Correcting Systematic Bias in LLM-Generated Dialogues Using Big Five Personality Traits , year=
-
[4]
Socially Responsible Language Modelling Research , year=
Do Personality Tests Generalize to Large Language Models? , author=. Socially Responsible Language Modelling Research , year=
-
[5]
and Ghanem, Bernard and Li, Guohao and Xie, Chengxing and Chen, Canyu , booktitle =
Jia, Feiran and Ye, Ziyu and Lai, Shiyang and Shu, Kai and Gu, Jindong and Bibi, Adel and Hu, Ziniu and Jurgens, David and Evans, James and Torr, Philip H.S. and Ghanem, Bernard and Li, Guohao and Xie, Chengxing and Chen, Canyu , booktitle =. Can Large Language Model Agents Simulate Human Trust Behavior? , volume =
-
[6]
Humanities and Social Sciences Communications , year=
Qu, Yao and Wang, Jue , title=. Humanities and Social Sciences Communications , year=
-
[7]
Proceedings of the 40th International Conference on Machine Learning , pages =
Whose Opinions Do Language Models Reflect? , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[8]
Toward accurate psychological simulations: Investigating LLMs’ responses to personality and cultural variables , journal =. 2025 , issn =
work page 2025
-
[9]
Modeling Human Subjectivity in LLM s Using Explicit and Implicit Human Factors in Personas
Giorgi, Salvatore and Liu, Tingting and Aich, Ankit and Isman, Kelsey Jane and Sherman, Garrick and Fried, Zachary and Sedoc, Jo \ a o and Ungar, Lyle and Curtis, Brenda. Modeling Human Subjectivity in LLM s Using Explicit and Implicit Human Factors in Personas. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024
work page 2024
-
[10]
Automated Software Engineering , year=
Gerosa, Marco and Trinkenreich, Bianca and Steinmacher, Igor and Sarma, Anita , title=. Automated Software Engineering , year=
-
[11]
Fleeson, William and Gallagher, Patrick. The implications of Big Five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis. J Pers Soc Psychol
-
[12]
Stick to your role! Stability of personal values expressed in large language models , year =
Kovač, Grgur AND Portelas, Rémy AND Sawayama, Masataka AND Dominey, Peter Ford AND Oudeyer, Pierre-Yves , journal =. Stick to your role! Stability of personal values expressed in large language models , year =
-
[13]
Xiao, Ziang and Zhou, Michelle X. and Liao, Q. Vera and Mark, Gloria and Chi, Changyan and Chen, Wenxi and Yang, Huahai , title =. ACM Trans. Comput.-Hum. Interact. , month = jun, articleno =. 2020 , issue_date =
work page 2020
-
[14]
Social Science Computer Review , volume =
Jan Karem Höhne and Konstantin Gavras and Joshua Claassen , title =. Social Science Computer Review , volume =. 2024 , URL =
work page 2024
-
[15]
Zhu, Zimeng and Hsu, Carol and Nah, Fiona Fui-Hoon and Liu, Na , title =. Internet Research , volume =. 2026 , month =
work page 2026
-
[16]
Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems , pages =
Baughan, Amanda and August, Tal and Yamashita, Naomi and Reinecke, Katharina , title =. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems , pages =. 2020 , isbn =
work page 2020
-
[17]
Journal of Product Innovation Management , volume =
Stock, Ruth Maria and Oliveira, Pedro and von Hippel, Eric , title =. Journal of Product Innovation Management , volume =
-
[18]
Selena Russo and Chiara Jongerius and Flavia Faccio and Silvia F.M. Pizzoli and Cathy Anne Pinto and Jorien Veldwijk and Rosanne Janssens and Gwenda Simons and Marie Falahee and Esther. Understanding Patients' Preferences: A Systematic Review of Psychological Instruments Used in Patients' Preference and Decision Studies , journal =. 2019 , issn =
work page 2019
- [19]
-
[20]
and Kahnau, Pia and Cassidy, Lauren C
Pfefferle, Dana and Talbot, Steven R. and Kahnau, Pia and Cassidy, Lauren C. and Brockhausen, Ralf R. and Jaap, Anne and Deikun, Veronika and Yurt, Pinar and Gail, Alexander and Treue, Stefan and Lewejohann, Lars , title=. Behavior Research Methods , year=
-
[21]
Tomlin, W. Craig. UX and Usability Testing Data. UX Optimization: Combining Behavioral UX and Usability Testing Data to Optimize Websites. 2018
work page 2018
-
[22]
Proceedings of the Mensch Und Computer 2025 , pages =
Lazik, Christopher Klaus and Katins, Christopher and Kauter, Charlotte and Jakob, Jonas and Jay, Caroline and Grunske, Lars and Kosch, Thomas , title =. Proceedings of the Mensch Und Computer 2025 , pages =. 2025 , isbn =
work page 2025
-
[23]
Out of One, Many: Using Language Models to Simulate Human Samples , volume=. Political Analysis , author=. 2023 , pages=
work page 2023
-
[24]
Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study , year =
H\". Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study , year =. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =
work page 2023
-
[25]
Adler and Jun Hwa Cheah , title =
Monika Imschloss and Marko Sarstedt and Susanne J. Adler and Jun Hwa Cheah , title =. The Service Industries Journal , volume =. 2025 , publisher =
work page 2025
-
[26]
Shanahan, Murray and McDonell, Kyle and Reynolds, Laria , title=. Nature , year=
-
[27]
Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents , articleno =
Zhang, Taiyu and Zhang, Xuesong and Cools, Robbe and Simeone, Adalberto , title =. Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents , articleno =. 2024 , isbn =
work page 2024
-
[28]
Lumintu, Ida , booktitle=. Content-Based Recommendation Engine Using Term Frequency-Inverse Document Frequency Vectorization and Cosine Similarity: A Case Study , year=
-
[29]
Organization Science , volume =
Hui, Xiang and Reshef, Oren and Zhou, Luofeng , title =. Organization Science , volume =. 2024 , URL =
work page 2024
- [30]
- [31]
-
[32]
Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI , journal =. 2024 , issn =
work page 2024
-
[33]
AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises , author=. 2026 , eprint=
work page 2026
-
[34]
Towards Measuring the Representation of Subjective Global Opinions in Language Models , author=. 2024 , eprint=
work page 2024
-
[35]
Harvard business school marketing unit working paper , number=
Using LLMs for market research , author=. Harvard business school marketing unit working paper , number=. 2023 , url=
work page 2023
-
[36]
Journal of Computing and Information Science in Engineering , volume=
Do large language models produce diverse design concepts? A comparative study with human-crowdsourced solutions , author=. Journal of Computing and Information Science in Engineering , volume=. 2025 , publisher=
work page 2025
-
[37]
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina , author=. 2025 , eprint=
work page 2025
-
[38]
A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas , author=. 2025 , eprint=
work page 2025
-
[39]
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity , author=. 2025 , eprint=
work page 2025
-
[40]
Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis , author=. 2024 , eprint=
work page 2024
- [41]
- [42]
-
[43]
Ma, Weicheng and Deng, Chunyuan and Moossavi, Aram and Wang, Lili and Vosoughi, Soroush and Yang, Diyi. Simulated Misinformation Susceptibility ( SMISTS ): Enhancing Misinformation Research with Large Language Model Simulations. Findings of the Association for Computational Linguistics: ACL 2024. 2024
work page 2024
-
[44]
Takaffoli, Macy and Li, Sijia and M\". Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry? , year =. Proceedings of the 2024 ACM Designing Interactive Systems Conference , pages =
work page 2024
-
[45]
Generating personas using LLMs and assessing their viability , year =
Schuller, Andreas and Janssen, Doris and Blumenr\". Generating personas using LLMs and assessing their viability , year =. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =
-
[46]
Journal of Mechanical Design , volume =
Zhu, Qihao and Chong, Leah and Yang, Maria and Luo, Jianxi , title =. Journal of Mechanical Design , volume =. 2025 , month =
work page 2025
-
[47]
International Journal of Design Creativity and Innovation , volume =
Jingoog Kim and Mary Lou Maher , title =. International Journal of Design Creativity and Innovation , volume =. 2023 , publisher =
work page 2023
-
[48]
Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , articleno =
Duan, Peitong and Cheng, Chin-Yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2024 , isbn =
work page 2024
-
[49]
Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =
Petridis, Savvas and Terry, Michael and Cai, Carrie Jun , title =. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =. 2023 , isbn =
work page 2023
-
[50]
The preference effect in design concept evaluation , journal =. 2014 , issn =
work page 2014
-
[51]
Economics and Philosophy , author=
Preferences: neither behavioural nor mental , volume=. Economics and Philosophy , author=. 2019 , pages=
work page 2019
-
[52]
Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick? , journal =. 2025 , issn =
work page 2025
-
[53]
Validation of information architecture: Cross-methodological comparison of tree testing variants and prototype user testing , journal =. 2025 , issn =
work page 2025
-
[54]
Proceedings of the National Academy of Sciences , volume =
Marcel Binz and Eric Schulz , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , url =
work page 2023
-
[55]
Journal of Hospitality and Tourism Technology , volume =
Sop, Serhat Adem and Kurçer, Doğa , title =. Journal of Hospitality and Tourism Technology , volume =. 2024 , month =
work page 2024
-
[56]
Democratizing eye-tracking? Appearance-based gaze estimation with improved attention branch , journal =. 2025 , issn =
work page 2025
-
[57]
Can behavioral features reveal lying in an online personality questionnaire? The impact of mouse dynamics and speech , journal =. 2025 , issn =
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.