When Do LLM Personas Support Visualization Design? A Cross-Model Study of Color Assignment and Chart Choice

Klaus Mueller; Shahreen Salim

arxiv: 2607.02455 · v1 · pith:JZL76XMEnew · submitted 2026-07-02 · 💻 cs.HC

When Do LLM Personas Support Visualization Design? A Cross-Model Study of Color Assignment and Chart Choice

Shahreen Salim , Klaus Mueller This is my paper

Pith reviewed 2026-07-03 05:44 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM personasvisualization designpersonality effectscolor assignmentchart choiceBig Five profilesmodel comparisonhuman-computer interaction

0 comments

The pith

LLM personas produce personality-linked color assignments only for certain models while chart choices depend mainly on task context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether Big Five personality profiles conditioned into large language models create stable effects during visualization design tasks. It finds that the link between personality traits and color choices for concepts is absent in one model version, fully present in another, and partial in a third. Abstract concepts show stronger personality influence on colors than concrete ones, while chart idiom rankings remain largely unchanged when personality prompts are removed. A sympathetic reader would care because the results clarify the conditions under which LLM personas can serve as useful early-stage stand-ins for diverse human users in visualization work.

Core claim

Personality-color coupling is highly model-configuration dependent: absent in GPT-4o-mini for all six concepts, consistent in GPT-4.1-mini across all six, and partial in GPT-5-mini for two of six. Concept type further shapes the signal: for abstract concepts, personality explains more hue variance than model identity, while concrete concepts show smaller and comparable effects. In chart choice, trait-aligned cluster aggregation produces stable top-idiom rankings across all nine cluster-context combinations, but a no-persona baseline recovers the same top choice in 8 of 9 model-context cells, indicating that task context drives rank-1 selection more than personality.

What carries the argument

Big Five personality profiles conditioned into LLM prompts for color assignment and chart preference tasks.

If this is right

Multi-model testing is required before treating LLM persona outputs as reliable for visualization design.
Abstract concepts produce clearer personality signals in color choices than concrete concepts.
No-persona baselines must be included to isolate whether personality actually changes chart recommendations.
LLM personas function as exploratory probes rather than substitutes for human participants in visualization studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design workflows could benefit from choosing specific model versions depending on whether personality variation is wanted in early color explorations.
The results raise the possibility that prompt refinements might reduce the observed model-to-model differences in personality effects.
If task context dominates chart selection, persona conditioning may add little value for idiom recommendation systems.

Load-bearing premise

The chosen Big Five profiles and task framings accurately capture personality effects relevant to visualization design without being artifacts of the specific prompt structures or model versions tested.

What would settle it

Re-running the color assignment tasks on GPT-4o-mini and observing consistent personality-color coupling across the six concepts would undermine the finding of high model dependence.

Figures

Figures reproduced from arXiv: 2607.02455 by Klaus Mueller, Shahreen Salim.

**Figure 2.** Figure 2: Variance decomposition of hue (HLCH) from a two-way ANOVA (model × persona) pooled across all three models. Persona η 2 exceeds model η 2 for all abstract concepts [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Cluster-based idiom preferences on GPT-4.1-mini. (a) Borda rank-1 idiom and within-context share per (cluster, context). (b) Borda [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-model agreement on chart preference. (a) Top [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Large language model personas are increasingly used to approximate diverse users during early-stage visualization design, but it remains unclear whether persona-conditioned outputs reflect stable personality effects or artifacts of model choice and task framing. We examine this question across two visualization-relevant tasks: color assignment for abstract and concrete concepts, and chart-idiom preference ratings across task contexts. Using 43 Big Five profiles across GPT-4o-mini, GPT-4.1-mini, and GPT-5-mini, we find that personality-color coupling is highly model-configuration dependent: absent in GPT-4o-mini for all six concepts, consistent in GPT-4.1-mini across all six, and partial in GPT-5-mini for two of six. Concept type further shapes the signal: for abstract concepts, personality explains more hue variance than model identity, while concrete concepts show smaller and comparable effects. In chart choice, trait-aligned cluster aggregation produces stable top-idiom rankings across all nine cluster-context combinations, but a no-persona baseline recovers the same top choice in 8 of 9 model-context cells, indicating that task context drives rank-1 selection more than personality. These findings position LLM personas as exploratory probes for visualization design, not substitutes for human participants, and motivate multi-model testing, concept-type disaggregation, and no-persona baselines in future studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core finding is that personality effects from LLM personas on color and chart tasks are model-dependent and frequently weaker than task context, shown through a three-model comparison with a no-persona baseline.

read the letter

The main thing here is the model-dependence result: personality-color coupling shows up consistently in one GPT variant, not at all in another, and only partially in the third. In chart rankings, the no-persona baseline matches the top choice in most cells, which suggests task framing drives more of the output than the Big Five profiles do. That disaggregation by model and by abstract vs concrete concepts is the actual new piece.

What works is the inclusion of the baseline and the cross-model setup. It gives concrete evidence that single-model persona studies can be brittle, and it positions the personas as probes rather than replacements. The directional claims line up with the abstract without overclaiming universality.

The soft spot is the lack of any methods detail in what we have. No sample sizes, no stats, no prompt templates, no exclusion rules. Without those, it's impossible to tell whether the reported differences are stable or sensitive to small changes in setup. The three models are all close variants, so the dependence finding might not generalize beyond this family.

This is for HCI and visualization researchers who are already experimenting with LLM user simulation. It adds a useful cautionary data point but does not resolve the broader question of when these approximations are reliable. It deserves peer review because the empirical pattern is worth checking and extending, even if the current write-up needs the methods section filled in to stand on its own.

Referee Report

1 major / 0 minor

Summary. The paper claims that LLM personas derived from Big Five personality profiles produce highly model-dependent effects in two visualization design tasks (color assignment to abstract/concrete concepts and chart-idiom preference). Across GPT-4o-mini, GPT-4.1-mini, and GPT-5-mini with 43 profiles, personality-color coupling is absent in one model, consistent in another, and partial in the third; abstract concepts show stronger personality effects than concrete ones. In chart choice, trait-aligned clusters yield stable top rankings, yet a no-persona baseline matches the top choice in 8/9 cells, indicating task context dominates personality. The work positions LLM personas as exploratory probes rather than human substitutes and calls for multi-model testing and baselines.

Significance. If the empirical patterns hold after methodological clarification, the result is significant for HCI and visualization research because it supplies concrete evidence that persona effects are unstable across models and often weaker than task framing. This directly informs the growing practice of using LLMs to simulate user diversity in early design and supplies actionable guidance (no-persona baselines, concept-type disaggregation) that can be adopted immediately in follow-on studies.

major comments (1)

[Abstract] Abstract: the directional claims (model-dependent color coupling, task-context dominance in chart choice) are presented without any information on experimental design, statistical tests, sample sizes per cell, prompt templates, or data-exclusion criteria. This absence is load-bearing because it prevents evaluation of whether the reported patterns are supported by the data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the findings for HCI and visualization research. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the directional claims (model-dependent color coupling, task-context dominance in chart choice) are presented without any information on experimental design, statistical tests, sample sizes per cell, prompt templates, or data-exclusion criteria. This absence is load-bearing because it prevents evaluation of whether the reported patterns are supported by the data.

Authors: We agree that the abstract omits these details, as is conventional for length-constrained abstracts (typically 150-250 words). The full manuscript provides them in Section 3 (Methods): experimental design uses 43 Big Five personality profiles (derived from standard inventories) applied to three models (GPT-4o-mini, GPT-4.1-mini, GPT-5-mini) across two tasks; statistical tests include variance partitioning (personality vs. model identity) for color assignments and rank-stability analysis for chart choices; sample size is 43 profiles per model-task cell with no data exclusion (all LLM outputs retained); prompt templates appear in Appendix A; and the no-persona baseline is explicitly described. The directional claims are directly supported by the quantitative results in Sections 4 and 5. We are willing to revise the abstract to include a one-sentence methods summary (e.g., "Across 43 Big Five profiles and three models...") if the editor prefers, though this would require trimming other content. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical observations only

full rationale

This is an empirical comparison study that reports direct outputs from LLM queries across models, personas, and tasks. No equations, derivations, fitted parameters, or self-citation chains are used to support the central claims. All reported patterns (model-dependent color coupling, concept-type effects, baseline comparisons) are presented as observations from the experimental runs rather than reductions to prior inputs or definitions. The study is self-contained against its own data collection protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on the domain assumption that Big Five traits can be meaningfully mapped to LLM conditioning for visualization tasks, with no free parameters or invented entities introduced.

axioms (1)

domain assumption Big Five personality model provides a valid basis for constructing LLM personas that approximate human trait effects in design tasks
Invoked by using 43 Big Five profiles to condition the models for color and chart tasks.

pith-pipeline@v0.9.1-grok · 5770 in / 1238 out tokens · 39015 ms · 2026-07-03T05:44:04.737404+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Abdou, A

M. Abdou, A. Kulmizev, D. Hershcovich, S. Frank, E. Pavlick, and A. Søgaard. Can language models encode perceptual structure without grounding? a case study in color.arXiv preprint arXiv:2109.06129,

work page arXiv
[2]

Alves, B

T. Alves, B. Ramalho, D. Gonc ¸alves, S. Gama, and J. Henriques- Calado. Exploring how personality models information visualization preferences. In2020 IEEE Visualization Conference (VIS), pp. 201–
[3]

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023. doi: 10. 1017/pan.2023.2 1

2023
[4]

R. Beran. Minimum Hellinger distance estimates for parametric mod- els.The Annals of Statistics, 5(3):445–463, 1977. doi: 10.1214/aos/ 1176343842 2

work page doi:10.1214/aos/ 1977
[5]

Cheng, E

M. Cheng, E. Durmus, and D. Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language mod- els. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1504–1532, 2023. doi: 10. 18653/v1/2023.acl-long.84 1, 5

2023
[6]

de Borda.M ´emoire sur les ´elections au scrutin

J.-C. de Borda.M ´emoire sur les ´elections au scrutin. Histoire de l’Acad´emie Royale des Sciences, Paris, 1781. 3
[7]

Dillion, N

D. Dillion, N. Tandon, Y . Gu, and K. Gray. Can AI language models replace human participants?Trends in Cognitive Sciences, 27(7):597– 600, 2023. doi: 10.1016/j.tics.2023.04.008 1

work page doi:10.1016/j.tics.2023.04.008 2023
[8]

Gupta, V

S. Gupta, V . Shrivastava, A. Deshpande, A. Kalyan, P. Clark, A. Sab- harwal, and T. Khot. Bias runs deep: Implicit reasoning biases in persona-assigned LLMs. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. 1, 5

2024
[9]

Jiang, M

G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models.Advances in Neural Information Processing Systems, 36, 2024. 1

2024
[10]

O. P. John and S. Srivastava. The big five trait taxonomy: History, measurement, and theoretical perspectives. 1999. 1

1999
[11]

Character Sequence Models for ColorfulWords

K. Kawakami, C. Dyer, B. R. Routledge, and N. A. Smith. Character sequence models for colorful words.ArXiv, abs/1609.08777, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Loyola, E

P. Loyola, E. Marrese-Taylor, and A. Hoyos-Idobro. Perceptual structure in the absence of grounding for llms: The impact of abstractedness and subjectivity in color language.arXiv preprint arXiv:2311.13105, 2023. 1

work page arXiv 2023
[13]

N. Mantel. The detection of disease clustering and a generalized re- gression approach.Cancer Research, 27(2):209–220, Feb. 1967. 2

1967
[14]

Marjieh, I

R. Marjieh, I. Sucholutsky, P. van Rijn, N. Jacoby, and T. L. Griffiths. Large language models predict human sensory judgments across six modalities.Scientific Reports, 14(1):21445, 2024. 1

2024
[15]

Colourful Language: Measuring Word-Colour Associations

S. Mohammad. Colourful language: Measuring word-colour associa- tions.arXiv preprint arXiv:1309.5942, 2013. 1

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Niszczota, M

P. Niszczota, M. Janczak, and M. Misiak. Large language models can replicate cross-cultural differences in personality.Journal of Research in Personality, p. 104584, 2025. 1

2025
[17]

J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

P. J. Rentfrow, S. D. Gosling, M. Jokela, D. J. Stillwell, M. Kosin- ski, and J. Potter. Divided we stand: three psychological regions of the united states and their political, economic, social, and health cor- relates.Journal of Personality and Social Psychology, 105(6):996,
[19]

Salim, T

S. Salim, T. Pial, and K. Mueller. What is the color of serendipity? in- vestigating the use of language models for semantically resonant color generation.IEEE Transactions on Visualization and Computer Graph- ics, 32(1):670–680, 2026. doi: 10.1109/TVCG.2025.3634243 1, 2

work page doi:10.1109/tvcg.2025.3634243 2026
[20]

Serapio-Garc ´ıa, M

G. Serapio-Garc ´ıa, M. Safdari, C. Crepy, L. Sun, S. Fitz, M. Abdulhai, A. Faust, and M. Matari´c. Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023. 1

work page arXiv 2023
[21]

B. Shu, L. Zhang, M. Choi, L. Dunagan, L. Logeswaran, M. Lee, D. Card, and D. Jurgens. You don’t need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments.ArXiv, abs/2311.09718, 2023. 1

work page arXiv 2023
[22]

X. Song, A. Gupta, K. Mohebbizadeh, S. Hu, and A. Singh. Have large language models developed a personality?: Applicability of self- assessment tests in measuring personality in llms.arXiv preprint arXiv:2305.14693, 2023. 1

work page arXiv 2023
[23]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Sto- ica. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 1 5

2023

[1] [1]

Abdou, A

M. Abdou, A. Kulmizev, D. Hershcovich, S. Frank, E. Pavlick, and A. Søgaard. Can language models encode perceptual structure without grounding? a case study in color.arXiv preprint arXiv:2109.06129,

work page arXiv

[2] [2]

Alves, B

T. Alves, B. Ramalho, D. Gonc ¸alves, S. Gama, and J. Henriques- Calado. Exploring how personality models information visualization preferences. In2020 IEEE Visualization Conference (VIS), pp. 201–

[3] [3]

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate. Out of one, many: Using language models to simulate human samples.Political Analysis, 31(3):337–351, 2023. doi: 10. 1017/pan.2023.2 1

2023

[4] [4]

R. Beran. Minimum Hellinger distance estimates for parametric mod- els.The Annals of Statistics, 5(3):445–463, 1977. doi: 10.1214/aos/ 1176343842 2

work page doi:10.1214/aos/ 1977

[5] [5]

Cheng, E

M. Cheng, E. Durmus, and D. Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language mod- els. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1504–1532, 2023. doi: 10. 18653/v1/2023.acl-long.84 1, 5

2023

[6] [6]

de Borda.M ´emoire sur les ´elections au scrutin

J.-C. de Borda.M ´emoire sur les ´elections au scrutin. Histoire de l’Acad´emie Royale des Sciences, Paris, 1781. 3

[7] [7]

Dillion, N

D. Dillion, N. Tandon, Y . Gu, and K. Gray. Can AI language models replace human participants?Trends in Cognitive Sciences, 27(7):597– 600, 2023. doi: 10.1016/j.tics.2023.04.008 1

work page doi:10.1016/j.tics.2023.04.008 2023

[8] [8]

Gupta, V

S. Gupta, V . Shrivastava, A. Deshpande, A. Kalyan, P. Clark, A. Sab- harwal, and T. Khot. Bias runs deep: Implicit reasoning biases in persona-assigned LLMs. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. 1, 5

2024

[9] [9]

Jiang, M

G. Jiang, M. Xu, S.-C. Zhu, W. Han, C. Zhang, and Y . Zhu. Evaluating and inducing personality in pre-trained language models.Advances in Neural Information Processing Systems, 36, 2024. 1

2024

[10] [10]

O. P. John and S. Srivastava. The big five trait taxonomy: History, measurement, and theoretical perspectives. 1999. 1

1999

[11] [11]

Character Sequence Models for ColorfulWords

K. Kawakami, C. Dyer, B. R. Routledge, and N. A. Smith. Character sequence models for colorful words.ArXiv, abs/1609.08777, 2016. 1

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Loyola, E

P. Loyola, E. Marrese-Taylor, and A. Hoyos-Idobro. Perceptual structure in the absence of grounding for llms: The impact of abstractedness and subjectivity in color language.arXiv preprint arXiv:2311.13105, 2023. 1

work page arXiv 2023

[13] [13]

N. Mantel. The detection of disease clustering and a generalized re- gression approach.Cancer Research, 27(2):209–220, Feb. 1967. 2

1967

[14] [14]

Marjieh, I

R. Marjieh, I. Sucholutsky, P. van Rijn, N. Jacoby, and T. L. Griffiths. Large language models predict human sensory judgments across six modalities.Scientific Reports, 14(1):21445, 2024. 1

2024

[15] [15]

Colourful Language: Measuring Word-Colour Associations

S. Mohammad. Colourful language: Measuring word-colour associa- tions.arXiv preprint arXiv:1309.5942, 2013. 1

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Niszczota, M

P. Niszczota, M. Janczak, and M. Misiak. Large language models can replicate cross-cultural differences in personality.Journal of Research in Personality, p. 104584, 2025. 1

2025

[17] [17]

J. S. Park, C. Q. Zou, A. Shaw, B. M. Hill, C. Cai, M. R. Morris, R. Willer, P. Liang, and M. S. Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

P. J. Rentfrow, S. D. Gosling, M. Jokela, D. J. Stillwell, M. Kosin- ski, and J. Potter. Divided we stand: three psychological regions of the united states and their political, economic, social, and health cor- relates.Journal of Personality and Social Psychology, 105(6):996,

[19] [19]

Salim, T

S. Salim, T. Pial, and K. Mueller. What is the color of serendipity? in- vestigating the use of language models for semantically resonant color generation.IEEE Transactions on Visualization and Computer Graph- ics, 32(1):670–680, 2026. doi: 10.1109/TVCG.2025.3634243 1, 2

work page doi:10.1109/tvcg.2025.3634243 2026

[20] [20]

Serapio-Garc ´ıa, M

G. Serapio-Garc ´ıa, M. Safdari, C. Crepy, L. Sun, S. Fitz, M. Abdulhai, A. Faust, and M. Matari´c. Personality traits in large language models. arXiv preprint arXiv:2307.00184, 2023. 1

work page arXiv 2023

[21] [21]

B. Shu, L. Zhang, M. Choi, L. Dunagan, L. Logeswaran, M. Lee, D. Card, and D. Jurgens. You don’t need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments.ArXiv, abs/2311.09718, 2023. 1

work page arXiv 2023

[22] [22]

X. Song, A. Gupta, K. Mohebbizadeh, S. Hu, and A. Singh. Have large language models developed a personality?: Applicability of self- assessment tests in measuring personality in llms.arXiv preprint arXiv:2305.14693, 2023. 1

work page arXiv 2023

[23] [23]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Sto- ica. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 1 5

2023