pith. sign in

arxiv: 2605.18311 · v1 · pith:XD4UTQYMnew · submitted 2026-05-18 · 💻 cs.HC

Distorted Perspectives of LLM-Simulated Preferences: Can AI Mislead Design?

Pith reviewed 2026-05-20 08:43 UTC · model grok-4.3

classification 💻 cs.HC
keywords LLM simulationdesign preferencesuser experiencepreference testingalgorithmic fidelityvisual designAI misalignmentsynthetic user data
0
0 comments X

The pith

LLM simulations of design preferences diverge systematically from real user choices across multiple setups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can stand in for real people when designers want quick feedback on visual interfaces and layouts. It draws on thousands of actual preference tests run on a live research platform and runs parallel simulations while varying the model, its reasoning steps, sampling settings, assigned personas, and prompt detail. The comparisons reveal consistent gaps that do not disappear when the simulation parameters change. Human answers show specific reasoning and balanced critique; LLM answers default to generic observations, repetition of obvious traits, and excessive praise. Because many design teams already consult LLMs for early direction, these gaps could steer final products away from what users actually prefer.

Core claim

Aggregated data from twenty-nine real preference tests (n = 2073) show significant and systematic discrepancies with LLM outputs; the mismatches remain stable when the model is altered in reasoning depth, sampling strategy, persona framing, or prompt specificity. LLM justifications substitute genuine nuance with patterns such as emphasis on generic visual properties, attention to isolated elements, unnecessary elaboration, and overpraising.

What carries the argument

Holistic multimodal simulation of preference-test stimuli, with controlled manipulation of LLM variables (reasoning, sampling, persona, specificity) to quantify alignment against real-user aggregates.

If this is right

  • Design teams that substitute LLM feedback for human testing risk creating interfaces that real users rate lower on preference measures.
  • LLM-generated design critiques tend to lack the balanced, context-specific reasoning that human participants provide.
  • Any automated pipeline that relies on current LLM preference simulation will inherit the same systematic biases observed here.
  • Patterns such as overpraising and generic focus can be used as diagnostic signals to flag low-fidelity LLM outputs in design workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could treat LLM output as a low-cost first pass that still requires targeted human checks on the specific dimensions where mismatches are largest.
  • The same simulation approach might be applied to other subjective judgments, such as content appeal or brand perception, to test whether similar distortions appear.
  • If the root cause lies in training-data coverage of visual design judgments, targeted fine-tuning on large preference datasets could narrow the observed gaps.

Load-bearing premise

The aggregated preference data from the UXtweak platform accurately reflects unbiased user choices without platform-specific selection effects or test-format artifacts.

What would settle it

A new set of preference tests collected outside the original platform, using different recruitment and response formats, that produces LLM outputs closely matching the human distribution would undermine the claim of persistent discrepancies.

Figures

Figures reproduced from arXiv: 2605.18311 by Eduard Kuric, Matus Krajcovic, Peter Demcak.

Figure 1
Figure 1. Figure 1: Research model. H2b. Temperature does not affect the similarity between LLM-synthesized and audience design preferences. H2c. Top-p does not affect the similarity between LLM-synthesized and audience design preferences. Works simulating participants have imposed various persona representations to prime models toward better alignment with audiences (Gerosa et al., 2024). Personas can represent individuals (… view at source ↗
Figure 2
Figure 2. Figure 2: Preference test LLM simulation procedure. Our ensemble of hypotheses demanded that the simulations be performed iteratively with different settings. We used GPT 4.1 as the baseline model to assess LLM-generated design preferences and their justifications, with a parameter configuration intended to improve its algorithmic fidelity. Mega-personas and recommended values of temperature and top_p = 1 were used … view at source ↗
Figure 3
Figure 3. Figure 3: Open-ended justification measures (a-d) and linguistic similarity of simulations to real justifications (e, f). argument, even as they were less likely to be relevant. In copywriting alternatives communicating the same message differently, LLMs failed to capture nuanced subjective reasons that caused some options to resonate with people more strongly Inconsistency with human justifications also translated … view at source ↗
read the original abstract

Designers of digital solutions increasingly consult Large Language Models (LLMs) for their work. However, it remains unclear how this may affect the user experiences they produce and there are no established practices. We investigate how design preferences expressed by LLM-driven simulation methods align with those of real users. We present a study that aggregates real-world data and design stimuli from twenty-nine preference tests conducted in practice by users of the UXtweak online research platform (n = 2073). We perform holistic multimodal simulations where we manipulate LLM variables (model reasoning, sampling, persona type, and specificity) and assess their effects on algorithmic fidelity. Our results unveil significant and systematic discrepancies between peoples' real design preferences and LLM simulations that are consistent across manipulations. Synthetic justifications lack genuine depth, nuance and reasoning, which they substitute by patterns like focus on generic properties, specific elements, elaboration and overpraising. The unique attention directed by this research toward preferences within visual design stimuli highlights misrepresentation of perception and meaning by LLMs in a context that is intuitive yet critical for design teams. The external and ecological validity of our findings is high, given their replication across a multitude of real-world studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript aggregates real-world design preference data from 29 UXtweak platform tests (n=2073) and compares them to LLM-driven multimodal simulations that systematically vary model reasoning, sampling, persona type, and specificity. It reports significant, manipulation-consistent discrepancies between real and simulated preferences, along with qualitative patterns in LLM justifications (generic focus, element-specific elaboration, overpraising) that lack depth or nuance. The work emphasizes high ecological validity from cross-study replication and highlights risks for design teams using LLMs to simulate user perception of visual stimuli.

Significance. If the discrepancies are robust to alternative elicitation methods, the findings would caution against direct substitution of LLM simulations for real-user preference testing in visual design, particularly given the multi-study scale and explicit manipulation of LLM variables. The external grounding in independent platform data and the focus on algorithmic fidelity in an applied HCI context add practical value beyond purely synthetic evaluations.

major comments (1)
  1. Methods / Study Design: The central claim attributes observed discrepancies to LLM limitations after treating the aggregated UXtweak preference tests as an unbiased ground truth for 'peoples’ real design preferences.' No explicit controls, sensitivity analyses, or discussion address platform selection effects (self-selected digital-savvy participants) or test-format artifacts (forced-choice visual stimuli), leaving open the possibility that these factors contribute to or drive the reported misalignment rather than LLM behavior alone.
minor comments (2)
  1. Abstract: The claim of 'significant and systematic discrepancies' would benefit from a brief statement of the exact discrepancy metric (e.g., choice agreement rate, rank correlation) and any statistical controls applied across the 29 studies.
  2. Results: The description of post-hoc coding of justification patterns (generic properties, specific elements, elaboration, overpraising) should include inter-coder reliability or a reproducible coding scheme to support the qualitative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important considerations for interpreting our real-world benchmark data. We address the major comment below and will incorporate revisions to clarify the scope of our findings.

read point-by-point responses
  1. Referee: The central claim attributes observed discrepancies to LLM limitations after treating the aggregated UXtweak preference tests as an unbiased ground truth for 'peoples’ real design preferences.' No explicit controls, sensitivity analyses, or discussion address platform selection effects (self-selected digital-savvy participants) or test-format artifacts (forced-choice visual stimuli), leaving open the possibility that these factors contribute to or drive the reported misalignment rather than LLM behavior alone.

    Authors: We agree that the manuscript would benefit from greater explicitness on this point. The UXtweak data is presented as an ecologically valid aggregation of real design preference tests rather than a universally unbiased ground truth for all people's preferences. To address the referee's concern, we will add a dedicated 'Limitations' subsection in the Discussion that discusses platform self-selection (e.g., digitally engaged participants) and forced-choice format effects as potential influences on the observed distributions. We will also note the consistency of discrepancies across the 29 independent studies as partial evidence of robustness, though we did not perform formal sensitivity analyses focused on these artifacts. We maintain that the core finding—systematic misalignment between LLM simulations and real aggregated preferences—remains informative for design practice even if the real data carries context-specific characteristics, but we will revise the text to avoid any implication of universal ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to independent external platform data

full rationale

The paper conducts an empirical study by aggregating real-world preference test data from 29 studies on the independent UXtweak platform (n=2073) and directly comparing it against LLM simulations under manipulated variables. No mathematical derivations, equations, fitted parameters, or self-citations are used to generate the central results; the discrepancies are measured against external user data rather than being constructed from the study's own inputs or prior author work. The analysis is therefore self-contained against external benchmarks with no reduction of outputs to inputs by definition or fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard statistical comparison methods and the representativeness of platform-collected user data without introducing new free parameters, axioms beyond basic statistical assumptions, or invented entities.

axioms (1)
  • domain assumption Aggregated preference data from multiple real-world tests can be treated as a reliable proxy for general user design preferences.
    Invoked when claiming high external validity and systematic discrepancies.

pith-pipeline@v0.9.0 · 5740 in / 1299 out tokens · 48693 ms · 2026-05-20T08:43:51.012563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

    Yoon, Se-eun and He, Zhankui and Echterhoff, Jessica and McAuley, Julian. Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024

  2. [2]

    and Schoenegger, Philipp and Zhu, Chongyang , title=

    Park, Peter S. and Schoenegger, Philipp and Zhu, Chongyang , title=. Behavior Research Methods , year=

  3. [3]

    Correcting Systematic Bias in LLM-Generated Dialogues Using Big Five Personality Traits , year=

    Sparrenberg, Lorenz and Schneider, Tobias and Deußer, Tobias and Koppenborg, Markus and Sifa, Rafet , booktitle=. Correcting Systematic Bias in LLM-Generated Dialogues Using Big Five Personality Traits , year=

  4. [4]

    Socially Responsible Language Modelling Research , year=

    Do Personality Tests Generalize to Large Language Models? , author=. Socially Responsible Language Modelling Research , year=

  5. [5]

    and Ghanem, Bernard and Li, Guohao and Xie, Chengxing and Chen, Canyu , booktitle =

    Jia, Feiran and Ye, Ziyu and Lai, Shiyang and Shu, Kai and Gu, Jindong and Bibi, Adel and Hu, Ziniu and Jurgens, David and Evans, James and Torr, Philip H.S. and Ghanem, Bernard and Li, Guohao and Xie, Chengxing and Chen, Canyu , booktitle =. Can Large Language Model Agents Simulate Human Trust Behavior? , volume =

  6. [6]

    Humanities and Social Sciences Communications , year=

    Qu, Yao and Wang, Jue , title=. Humanities and Social Sciences Communications , year=

  7. [7]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Whose Opinions Do Language Models Reflect? , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  8. [8]

    2025 , issn =

    Toward accurate psychological simulations: Investigating LLMs’ responses to personality and cultural variables , journal =. 2025 , issn =

  9. [9]

    Modeling Human Subjectivity in LLM s Using Explicit and Implicit Human Factors in Personas

    Giorgi, Salvatore and Liu, Tingting and Aich, Ankit and Isman, Kelsey Jane and Sherman, Garrick and Fried, Zachary and Sedoc, Jo \ a o and Ungar, Lyle and Curtis, Brenda. Modeling Human Subjectivity in LLM s Using Explicit and Implicit Human Factors in Personas. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024

  10. [10]

    Automated Software Engineering , year=

    Gerosa, Marco and Trinkenreich, Bianca and Steinmacher, Igor and Sarma, Anita , title=. Automated Software Engineering , year=

  11. [11]

    The implications of Big Five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis

    Fleeson, William and Gallagher, Patrick. The implications of Big Five standing for the distribution of trait manifestation in behavior: fifteen experience-sampling studies and a meta-analysis. J Pers Soc Psychol

  12. [12]

    Stick to your role! Stability of personal values expressed in large language models , year =

    Kovač, Grgur AND Portelas, Rémy AND Sawayama, Masataka AND Dominey, Peter Ford AND Oudeyer, Pierre-Yves , journal =. Stick to your role! Stability of personal values expressed in large language models , year =

  13. [13]

    and Liao, Q

    Xiao, Ziang and Zhou, Michelle X. and Liao, Q. Vera and Mark, Gloria and Chi, Changyan and Chen, Wenxi and Yang, Huahai , title =. ACM Trans. Comput.-Hum. Interact. , month = jun, articleno =. 2020 , issue_date =

  14. [14]

    Social Science Computer Review , volume =

    Jan Karem Höhne and Konstantin Gavras and Joshua Claassen , title =. Social Science Computer Review , volume =. 2024 , URL =

  15. [15]

    Internet Research , volume =

    Zhu, Zimeng and Hsu, Carol and Nah, Fiona Fui-Hoon and Liu, Na , title =. Internet Research , volume =. 2026 , month =

  16. [16]

    Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems , pages =

    Baughan, Amanda and August, Tal and Yamashita, Naomi and Reinecke, Katharina , title =. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems , pages =. 2020 , isbn =

  17. [17]

    Journal of Product Innovation Management , volume =

    Stock, Ruth Maria and Oliveira, Pedro and von Hippel, Eric , title =. Journal of Product Innovation Management , volume =

  18. [18]

    Pizzoli and Cathy Anne Pinto and Jorien Veldwijk and Rosanne Janssens and Gwenda Simons and Marie Falahee and Esther

    Selena Russo and Chiara Jongerius and Flavia Faccio and Silvia F.M. Pizzoli and Cathy Anne Pinto and Jorien Veldwijk and Rosanne Janssens and Gwenda Simons and Marie Falahee and Esther. Understanding Patients' Preferences: A Systematic Review of Psychological Instruments Used in Patients' Preference and Decision Studies , journal =. 2019 , issn =

  19. [19]

    , title =

    Lee, Sangwon and Koubek, Richard J. , title =. Interacting with Computers , volume =. 2010 , month =

  20. [20]

    and Kahnau, Pia and Cassidy, Lauren C

    Pfefferle, Dana and Talbot, Steven R. and Kahnau, Pia and Cassidy, Lauren C. and Brockhausen, Ralf R. and Jaap, Anne and Deikun, Veronika and Yurt, Pinar and Gail, Alexander and Treue, Stefan and Lewejohann, Lars , title=. Behavior Research Methods , year=

  21. [21]

    Tomlin, W. Craig. UX and Usability Testing Data. UX Optimization: Combining Behavioral UX and Usability Testing Data to Optimize Websites. 2018

  22. [22]

    Proceedings of the Mensch Und Computer 2025 , pages =

    Lazik, Christopher Klaus and Katins, Christopher and Kauter, Charlotte and Jakob, Jonas and Jay, Caroline and Grunske, Lars and Kosch, Thomas , title =. Proceedings of the Mensch Und Computer 2025 , pages =. 2025 , isbn =

  23. [23]

    Political Analysis , author=

    Out of One, Many: Using Language Models to Simulate Human Samples , volume=. Political Analysis , author=. 2023 , pages=

  24. [24]

    Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study , year =

    H\". Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study , year =. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =

  25. [25]

    Adler and Jun Hwa Cheah , title =

    Monika Imschloss and Marko Sarstedt and Susanne J. Adler and Jun Hwa Cheah , title =. The Service Industries Journal , volume =. 2025 , publisher =

  26. [26]

    Nature , year=

    Shanahan, Murray and McDonell, Kyle and Reynolds, Laria , title=. Nature , year=

  27. [27]

    Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents , articleno =

    Zhang, Taiyu and Zhang, Xuesong and Cools, Robbe and Simeone, Adalberto , title =. Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents , articleno =. 2024 , isbn =

  28. [28]

    Content-Based Recommendation Engine Using Term Frequency-Inverse Document Frequency Vectorization and Cosine Similarity: A Case Study , year=

    Lumintu, Ida , booktitle=. Content-Based Recommendation Engine Using Term Frequency-Inverse Document Frequency Vectorization and Cosine Similarity: A Case Study , year=

  29. [29]

    Organization Science , volume =

    Hui, Xiang and Reshef, Oren and Zhou, Luofeng , title =. Organization Science , volume =. 2024 , URL =

  30. [30]

    , title =

    Niederhoffer, Kate and Kellerman, Gabriella Rosen and Lee, Angela and Liebscher, Alex and Rapuano, Kristina and Hancock, Jeffrey T. , title =. 2025 , month =

  31. [31]

    Noûs , volume =

    Dietrich, Franz and List, Christian , title =. Noûs , volume =

  32. [32]

    2024 , issn =

    Trust and reliance on AI — An experimental study on the extent and costs of overreliance on AI , journal =. 2024 , issn =

  33. [33]

    2026 , eprint=

    AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises , author=. 2026 , eprint=

  34. [34]

    2024 , eprint=

    Towards Measuring the Representation of Subjective Global Opinions in Language Models , author=. 2024 , eprint=

  35. [35]

    Harvard business school marketing unit working paper , number=

    Using LLMs for market research , author=. Harvard business school marketing unit working paper , number=. 2023 , url=

  36. [36]

    Journal of Computing and Information Science in Engineering , volume=

    Do large language models produce diverse design concepts? A comparative study with human-crowdsourced solutions , author=. Journal of Computing and Information Science in Engineering , volume=. 2025 , publisher=

  37. [37]

    2025 , eprint=

    Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina , author=. 2025 , eprint=

  38. [38]

    2025 , eprint=

    A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity , author=. 2025 , eprint=

  40. [40]

    2024 , eprint=

    Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis , author=. 2024 , eprint=

  41. [41]

    2025 , month =

    Introducing. 2025 , month =

  42. [42]

    2024 , note =

    Models -. 2024 , note =

  43. [43]

    Simulated Misinformation Susceptibility ( SMISTS ): Enhancing Misinformation Research with Large Language Model Simulations

    Ma, Weicheng and Deng, Chunyuan and Moossavi, Aram and Wang, Lili and Vosoughi, Soroush and Yang, Diyi. Simulated Misinformation Susceptibility ( SMISTS ): Enhancing Misinformation Research with Large Language Model Simulations. Findings of the Association for Computational Linguistics: ACL 2024. 2024

  44. [44]

    Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry? , year =

    Takaffoli, Macy and Li, Sijia and M\". Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry? , year =. Proceedings of the 2024 ACM Designing Interactive Systems Conference , pages =

  45. [45]

    Generating personas using LLMs and assessing their viability , year =

    Schuller, Andreas and Janssen, Doris and Blumenr\". Generating personas using LLMs and assessing their viability , year =. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =

  46. [46]

    Journal of Mechanical Design , volume =

    Zhu, Qihao and Chong, Leah and Yang, Maria and Luo, Jianxi , title =. Journal of Mechanical Design , volume =. 2025 , month =

  47. [47]

    International Journal of Design Creativity and Innovation , volume =

    Jingoog Kim and Mary Lou Maher , title =. International Journal of Design Creativity and Innovation , volume =. 2023 , publisher =

  48. [48]

    Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , articleno =

    Duan, Peitong and Cheng, Chin-Yi and Li, Gang and Hartmann, Bjoern and Li, Yang , title =. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2024 , isbn =

  49. [49]

    Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =

    Petridis, Savvas and Terry, Michael and Cai, Carrie Jun , title =. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems , articleno =. 2023 , isbn =

  50. [50]

    2014 , issn =

    The preference effect in design concept evaluation , journal =. 2014 , issn =

  51. [51]

    Economics and Philosophy , author=

    Preferences: neither behavioural nor mental , volume=. Economics and Philosophy , author=. 2019 , pages=

  52. [52]

    2025 , issn =

    Is usability testing valid with prototypes where clickable hotspots are highlighted upon misclick? , journal =. 2025 , issn =

  53. [53]

    2025 , issn =

    Validation of information architecture: Cross-methodological comparison of tree testing variants and prototype user testing , journal =. 2025 , issn =

  54. [54]

    Proceedings of the National Academy of Sciences , volume =

    Marcel Binz and Eric Schulz , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , url =

  55. [55]

    Journal of Hospitality and Tourism Technology , volume =

    Sop, Serhat Adem and Kurçer, Doğa , title =. Journal of Hospitality and Tourism Technology , volume =. 2024 , month =

  56. [56]

    2025 , issn =

    Democratizing eye-tracking? Appearance-based gaze estimation with improved attention branch , journal =. 2025 , issn =

  57. [57]

    2025 , issn =

    Can behavioral features reveal lying in an online personality questionnaire? The impact of mouse dynamics and speech , journal =. 2025 , issn =