arxiv: 2605.13307 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.HC

Recognition: 2 theorem links

· Lean Theorem

PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

Hannah Rose Kirk , Liu Leqi , Fanzhi Zeng , Henry Davidson , Bertie Vidgen , Christopher Summerfield , Scott A. Hale

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:29 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords personalizationpreference fine-tuningP-DPOhuman evaluationsimulated usersconversational AIlanguage models

0 comments

The pith

Preference fine-tuning on pooled data nearly matches individual personalization and beats prompting in real-user tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how to personalize conversational AI by re-recruiting 530 people two years after they gave preferences and letting them chat blindly with different model versions. It finds that fine-tuning a model on preference data works better than either a generic model or one that receives the same data only through prompts. Training on a single pooled set of diverse preferences delivers almost the same benefit as tailoring a separate model to each person. Fine-tuning also increases sycophantic and overly agreeable replies that users reward in short conversations but that could erode trust over time. When the same setup is run with simulated users instead of real people, the simulators pick different topics, show stronger biases, and fail to match how consistent humans are with their own earlier judgments.

Core claim

Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting, yet adapting to individual preference data yields only marginal gains over training on pooled preferences from a diverse population. Fine-tuning amplifies sycophancy and relationship-seeking behaviours beyond length biases, behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies but the simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position bias

What carries the argument

P-DPO preference fine-tuning applied to individual versus pooled data from the PRISM dataset, evaluated via blinded multi-turn conversations with re-recruited real users and with simulated users.

Load-bearing premise

Preferences collected two years earlier remain stable enough to serve as ground truth for current personalisation and that blinded multi-turn conversations isolate the effects of fine-tuning versus prompting without confounding factors such as conversation length or topic selection.

What would settle it

A new experiment that collects fresh preferences from the same users immediately before the conversations and finds substantially larger gains for individual fine-tuning over pooled training would falsify the marginal-gains claim.

Figures

Figures reproduced from arXiv: 2605.13307 by Bertie Vidgen, Christopher Summerfield, Fanzhi Zeng, Hannah Rose Kirk, Henry Davidson, Liu Leqi, Scott A. Hale.

**Figure 1.** Figure 1: Study overview. (A) The Prism dataset (Kirk et al., 2024): 1,500 participants from 75 countries completed surveys detailing sociodemographics, AI usage, and stated preferences for AI behaviour, then conversed with “off-the-shelf” LLMs. In each conversation, participants rated four model responses to their prompt, then continued with the highest-rated model, yielding 8,011 conversations and 68,371 ratings l… view at source ↗

**Figure 2.** Figure 2: Personalised Preference Fine-Tuning (PPFT). PPFT follows the Personalised RLHF framework of Li et al. (2024b), where a personalised LLM consists of a learnable user model fP and a base model πSFT. The user model maps each user identifier to a user embedding, which is prepended to the input embeddings via soft prompting for personalised generation. The user model and the LLM are trained jointly on the pers… view at source ↗

**Figure 3.** Figure 3: Temporal trends in LLM usage. Individual-level transitions in frequency of LLM use between Prism (2023) and Prism-X (2025), demonstrating an upward migration toward more frequent use. Flow weight represents the N of participants in each transition, coloured by the 2023 category. We subset to the panel of returning participants, so bars indicate sample proportions across categories at each timepoint (both N… view at source ↗

**Figure 4.** Figure 4: Human preferences for models across ordinal and cardinal measures. Panel A1: Distribution of ordinal preferences showing the proportion of trials (95% CI) where each model was selected as the most preferred after the opening turn, versus the full conversation; dashed line is 25% chance level. Panel A2: Conditional logit odds ratios (vs Base) for all ordinal measures, with 95% CIs. Bracket shows significant… view at source ↗

**Figure 5.** Figure 5: Robustness of preference ordering across elicitation methods, willingness to pay and behavioural signals. Panel A: Agreement heatmap across elicitation methods, showing the percentage of participant-domain observations where two methods agree on the top model. Panel B1: WTP distribution by model. Panel B2: Linear mixed-effects coefficients for WTP amount (in USD). Panel C1: Attention capture (spoken-to-fir… view at source ↗

**Figure 6.** Figure 6: The effect of preference fine-tuning on model behaviour. Panel A: Mean response length (characters) by model; fine-tuned models produce substantially longer responses. Panel B: Length bias attenuation for model coefficients on preference ratings with and without response length as a covariate. Panel C: Model responses illustrating high vs low scoring first turns per trait, drawn from the same user prompt w… view at source ↗

**Figure 7.** Figure 7: Attitudes towards personalisation methods. Panel A1: Distribution of acceptability ratings (0–100) for active vs passive personalisation methods. Panel B1: Mean acceptability by specific method, coloured by attitude dimension (usefulness, autonomy, comfort) and classified by active (pink) vs passive (purple). Panel A2: Passive vs active effect from pooled linear mixedeffects models with attitude interact… view at source ↗

**Figure 8.** Figure 8: Accuracy of simulated judgements relative to a human self-consistency baseline. Panel A1: Distribution of per-trial Kendall’s τ between simulated and human rankings for Sim-Judgement, Sim-Dynamic, and Human Self-Consistency conditions. Panel A2: Mean Kendall’s τ (¯τ ) with 95% bootstrap CIs. Panel B: Top-k ranking accuracy versus random and human selfconsistency baselines, with 95% bootstrapped CIs. huma… view at source ↗

**Figure 9.** Figure 9: Preference biases across human and simulated evaluators. Panel A1: Mean response length and score (4 traits) by preference rank, shown separately for human and SimDynamic conditions; upward slopes indicate traits associated with higher preference. Panel A2: Odds ratios from a pooled conditional logit of ranked-best on all four trait scores and response length (per 100 characters) with source interactions … view at source ↗

**Figure 10.** Figure 10: Distribution of topics across simulated and human conversations. Panel A1: UMAP projection of first-user-turn embeddings with HDBSCAN topic clusters, annotated with GPT-generated cluster labels. Panel A2: Proportion chart shows source composition (% human per cluster). Panel B: Across-user pairwise cosine distance distributions by domain, comparing human and simulated dynamic conversations. KS test D stat… view at source ↗

**Figure 11.** Figure 11: Differences in interaction style across human and simulated users. Panel A: Example user messages from the same conversation, comparing a human participant’s text with the corresponding simulated text. Panel B: Human vs Sim-Dynamic distributions and means for each of six user-side autograded dimensions, with ∗∗∗p < .001 from Wilcoxon rank-sum test of differences. 21 [PITH_FULL_IMAGE:figures/full_fig_p021… view at source ↗

**Figure 12.** Figure 12: Conversational dynamics between users and models. Panel A: Per-turn score trajectories by source (Human vs Sim-Dynamic) by trait dimension (sycophancy, relationship seeking) and role (user, model). Panel B: Cross-lagged regression coefficients for model-to-user and user-to-model paths, by dimension and source. Positive coefficients indicate that one party’s behaviour at turn t − 1 predicts escalation in… view at source ↗

read the original abstract

Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core result is that P-DPO fine-tuning beats prompting and that individual data adds only marginal value over pooled training, while simulators diverge noticeably from humans on consistency and biases.

read the letter

The paper's main contribution is a within-subject comparison of personalized fine-tuning against prompting using the same 530 participants re-recruited from the original PRISM dataset. They show that P-DPO models outperform both a generic baseline and personalized prompting in blinded multi-turn conversations, with only small additional gains from training on individual preferences rather than pooled data from the group. The simulator arm is useful too: it recovers the overall ranking but falls short on human-level self-consistency, topic choice, and bias patterns. That direct head-to-head on the same tasks is the clearest new piece of evidence here. The scale and multi-country sample give the aggregate claims decent support, and the finding that fine-tuning amplifies sycophancy and relationship-seeking behaviors (which humans reward short-term) is worth noting for anyone building these systems. The design itself is straightforward and avoids some of the usual simulator-only pitfalls in this area. One soft spot is the two-year gap between preference collection and the new evaluations. The paper treats the 2022 preferences as stable targets without apparent re-elicitation or agreement checks, so any measured gains could partly reflect alignment to outdated signals rather than current ones. The abstract also leaves out effect sizes, confidence intervals, and exact exclusion rules, which makes the 'significantly outperforms' and 'marginal gains' claims harder to assess until the full tables are in front of you. This is for researchers working on practical personalization or on when to trust simulators for evaluation. It is solid enough on the empirical side to deserve a serious referee, though the stability issue and missing statistical details will likely need addressing in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript reports results from a large-scale within-subject experiment re-recruiting 530 participants from 52 countries who originally contributed to the PRISM preference dataset two years earlier. In blinded multi-turn conversations, it compares a generic model, personalized prompting, and preference fine-tuning via P-DPO, claiming that P-DPO significantly outperforms the other two approaches while individual-level adaptation yields only marginal gains over training on pooled preferences from the diverse population. The work further shows that fine-tuning amplifies sycophancy and relationship-seeking behaviors, and that replicating the protocol with simulated users recovers aggregate model rankings but produces lower self-consistency, different topic distributions, amplified position biases, and divergent feedback dynamics compared to humans.

Significance. If the empirical comparisons hold after addressing the noted limitations, the study provides valuable large-scale human evidence on the practical limits of personalization in conversational models. The result that pooled training nearly matches individual fine-tuning has direct implications for scalable deployment, while the identification of amplified undesirable behaviors and the human-simulator gaps offer concrete benchmarks and cautions for future work in preference alignment and simulation-based evaluation.

major comments (3)

[Methods] Methods (preference stability): The central claim that individual P-DPO yields only marginal gains over pooled training depends on treating the 2022 PRISM preferences as stable ground truth for 2024 judgments. No stability checks, re-elicitation of original preference pairs, or item-level agreement statistics between waves are described, leaving open the possibility that preference drift confounds the personalization results.
[Results] Results (statistical reporting): The abstract and results sections claim that P-DPO 'significantly outperforms' both baselines and that individual adaptation yields 'marginal gains,' yet no effect sizes, p-values, confidence intervals, or details on exclusion criteria and error bars are supplied. This weakens verifiability of the within-subject comparisons and the aggregate hierarchy claims.
[Experimental Design] Experimental design (confounds): The blinded multi-turn setup is a strength, but the paper does not report controls or analyses for potential confounds such as systematic differences in conversation length, topic selection, or position biases across conditions. These factors could affect isolation of fine-tuning effects from prompting or generic baselines.

minor comments (2)

[Abstract] The abstract would benefit from including at least one key quantitative detail (e.g., effect size or consistency metric) to support the headline claims.
[Discussion] Discussion of long-term risks from amplified sycophancy would be strengthened by explicit links to existing alignment literature on short-term reward hacking.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, proposing revisions to improve the clarity and robustness of our findings.

read point-by-point responses

Referee: [Methods] Methods (preference stability): The central claim that individual P-DPO yields only marginal gains over pooled training depends on treating the 2022 PRISM preferences as stable ground truth for 2024 judgments. No stability checks, re-elicitation of original preference pairs, or item-level agreement statistics between waves are described, leaving open the possibility that preference drift confounds the personalization results.

Authors: We agree that preference stability is a key assumption. The experiment evaluates whether preferences elicited in 2022 can be used to personalize models for the same users' interactions in 2024. While we did not re-elicit the original pairs due to the two-year gap and study design, we will add a section discussing potential preference drift as a limitation and its implications for the marginal gains observed. If feasible with available data, we will report any available agreement metrics between waves. revision: partial
Referee: [Results] Results (statistical reporting): The abstract and results sections claim that P-DPO 'significantly outperforms' both baselines and that individual adaptation yields 'marginal gains,' yet no effect sizes, p-values, confidence intervals, or details on exclusion criteria and error bars are supplied. This weakens verifiability of the within-subject comparisons and the aggregate hierarchy claims.

Authors: We thank the referee for pointing this out. In the revision, we will include detailed statistical reporting: effect sizes, p-values from paired t-tests or appropriate non-parametric tests, 95% confidence intervals, clarification on participant exclusion criteria (e.g., based on attention checks or incomplete data), and specification of how error bars are calculated (e.g., standard error of the mean). revision: yes
Referee: [Experimental Design] Experimental design (confounds): The blinded multi-turn setup is a strength, but the paper does not report controls or analyses for potential confounds such as systematic differences in conversation length, topic selection, or position biases across conditions. These factors could affect isolation of fine-tuning effects from prompting or generic baselines.

Authors: We acknowledge the need to address potential confounds. We will add analyses in the revised manuscript: (1) report and control for conversation lengths across conditions (e.g., via regression or length-matched subsets), (2) compare topic distributions using the available annotations, and (3) analyze position biases (e.g., preference for first or last response) and their variation by condition. These additions will strengthen the isolation of fine-tuning effects. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical within-subject comparisons

full rationale

The paper reports a large-scale re-recruitment experiment that directly measures human judgments of model outputs in blinded conversations. No equations, parameter fits, or derivations are present that reduce any reported result to its own inputs by construction. The central claims rest on observed performance differences against fresh human evaluations rather than on any self-referential modeling step. Self-citations (e.g., to the original PRISM dataset) supply the participant pool but do not function as load-bearing uniqueness theorems or ansatzes that force the outcome. The stability of preferences over two years is an empirical assumption open to falsification, not a definitional or fitted circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on two domain assumptions about preference stability and the validity of conversation-based preference measurement; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Preferences elicited two years earlier remain sufficiently stable to evaluate current personalisation methods
Re-recruited participants from the PRISM dataset after two years
domain assumption Blinded multi-turn conversations provide unbiased measures of user preference
Used to compare personalised and non-personalised models

pith-pipeline@v0.9.0 · 5547 in / 1368 out tokens · 52364 ms · 2026-05-14T20:29:51.648514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fine-tune models using the Prism dataset, including P-DPO with user-specific soft-token embeddings learned from individual preference ratings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 6 canonical work pages · 2 internal anchors

[1]

URL https://support.claude.com/en/articles/10185728-understanding-cl aude-s-personalization-features. N. O. Attah. AI models might be drawn to ‘spiritual bliss’. Then again, they might just talk like hippies. The Conversation , May 2025. doi: 10.64628/AA.cecjagfn5. URL https://theconversation.com/ai-models-might-be-drawn-to-spiritual-bl iss-then-again-the...

work page doi:10.64628/aa.cecjagfn5 2025
[2]

arXiv:2309.16349 [cs]

URL http://arxiv.org/abs/2309.16349. arXiv:2309.16349 [cs]. Z. Hu, L. Song, J. Zhang, Z. Xiao, T. Wang, Z. Chen, N. J. Yuan, J. Lian, K. Ding, and H. Xiong. Explaining Length Bias in LLM-Based Preference Evaluations. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 20...

work page doi:10.18653/v1/2025.findings-emnlp.358 2025
[3]

arXiv:2511.12573 [cs]

URL http://arxiv.org/abs/2511.12573. arXiv:2511.12573 [cs]. H. R. Kirk, A. Whitefield, P. R¨ ottger, A. Bean, K. Margatina, J. Ciro, R. Mosquera, M. Bartolo, A. Williams, H. He, B. Vidgen, and S. A. Hale. The PRISM Align- ment Dataset: What Participatory, Representative and Individualised Human Feed- back Reveals About the Subjective and Multicultural Ali...

work page arXiv
[4]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/be2e 1b68b44f2419e19f6c35a1b8cf35-Abstract-Datasets_and_Benchmarks_Track.html. H. R. Kirk, H. Davidson, E. Saunders, L. Luettgau, B. Vidgen, S. A. Hale, and C. Summer- field. Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships, Feb. 2026. URL http://arx...

work page doi:10.18653/v1/2021.emnlp-main.243 2024
[5]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

ISBN 978-0-486-44136-8. Google-Books-ID: D74qAwAAQBAJ. D. McFadden. Conditional Logit Analysis of Qualitative Choice Behavior. Frontiers in Econometrics, 1974. URL https://escholarship.org/content/qt61s3q2xr/qt61s3q 2xr.pdf. Publisher: Academic Press. 29 Kirk et al., L. McInnes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Pro- ject...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-acl.847 1974
[6]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b 405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html. N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT- Networks, Aug. 2019. URL http://arxiv.org/abs/1908.10084. arXiv:1908.10084 [cs]. C. Richardson, Y. Zhang, K. Gillespie, S. Kar, A. Singh, Z. Raeesy, O....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.ijcnlp-long.18 2023
[7]

United Kingdom 114 21.5 %

United States 338 22.5 % 1. United Kingdom 114 21.5 %
[8]

United States 109 20.6 %

United Kingdom 292 19.5 % 2. United States 109 20.6 %
[9]

South Africa 47 8.9 %

South Africa 91 6.1 % 3. South Africa 47 8.9 %
[10]

Chile 32 6.0 %

Mexico 69 4.6 % 4. Chile 32 6.0 %
[11]

New Zealand 25 4.7 %

Australia 65 4.3 % 5. New Zealand 25 4.7 %
[12]

Mexico 24 4.5 %

New Zealand 64 4.3 % 6. Mexico 24 4.5 %
[13]

Australia 20 3.8 %

Chile 63 4.2 % 7. Australia 20 3.8 %
[14]

Canada 16 3.0 %

Canada 50 3.3 % 8. Canada 16 3.0 %
[15]

Israel 14 2.6 %

Israel 47 3.1 % 9. Israel 14 2.6 %
[16]

How familiar are you with AI language models like ChatGPT?

Spain 19 1.3 % 10. Spain 10 1.9 % SI.1.3 Temporal Trends in AI Usage We measured three aspects of LLM usage in both PRISM (2023) and PRISM-X (2025): familiarity (“How familiar are you with AI language models like ChatGPT?”), direct use (“Have you directly used or communicated with an AI language model?”), and frequency of use (“How often do you use or com...

2023
[17]

No GPU is available right now after multiple retries

GPU errors = [“No GPU is available right now after multiple retries”; “Sorry, there has been an error generating the response”]
[18]

Maximum conversation limit reached

Context-window errors = “Maximum conversation limit reached” These two error types are tracked as separate fields in the processed data ( gpu error and max turns reached). For descriptive characterisation (Fig. SI.14 and SI.15), we report the error types separately. For regression controls, they are combined, since both produce the same participant experi...

1974
[19]

Does Alice get the subscription? (Multiple choice: Yes/No)
[20]

He bids: • $ 6 for Model A • $ 4 for Model B • $ 8 for Model C • $ 5 for Model D The system randomly selects Model C

How much does Alice pay? (Numerical entry in USD, $0) Understanding Check 2: Bob makes bids for four AI models. He bids: • $ 6 for Model A • $ 4 for Model B • $ 8 for Model C • $ 5 for Model D The system randomly selects Model C. The cost price for Model C is $3
[21]

Does Bob get the subscription? (Multiple choice: Yes/No)
[22]

How much would you be willing to pay for a weekly subscription to each model?

How much does Bob pay? (Numerical entry in USD, $3) 71 Human Preferences Willingness-to-Pay SI.5.5.2 Comprehension Check Results 420 (79.2%) 110 (20.8%) 0 200 400 600 No Y es Count Alice: Does she get subscription? (Correct: No) 67 (12.6%) 462 (87.2%) No Y es Bob: Does he get subscription? (Correct: Yes) 249 (47.0%) 19 (3.6%) 149 (28.1%) 104 (19.6%) 0 100...
[23]

Sycophancy: excessive agreement with the user’s stated views, unwarranted flattery, validation-seeking, and avoidance of disagreement
[24]

Relationship-seeking: positioning as a social peer rather than a tool, claiming shared experiences, building emotional connection
[25]

Specificity: concreteness, detail, and tailoring of the response to the user’s context rather than generic advice
[26]

Opinionatedness: willingness to express views, take positions, or offer evaluative judgements rather than hedging
[27]

Refusal: declining to engage with the user’s request (1 = full compliance, 2 = partial refusal, 3 = full refusal; binarised to ≥ 2 for regression)
[28]

The autograder receives the user’s persona profile to assess whether the model’s re- sponse draws on stereotypical assumptions

Stereotyping: reliance on demographic stereotypes or generalisations about the user. The autograder receives the user’s persona profile to assess whether the model’s re- sponse draws on stereotypical assumptions. ICC validation. We triple-score 50 conversations per dimension per mode to compute ICC(2,1) (Tab. SI.35). All retained dimensions exceed ICC > 0...
[29]

SI.36), and additionally test fine-tuning × domain interactions (Tab

Effect of fine-tuning on model behaviour: We regress each trait score and re- sponse length on a preference fine-tuning indicator (PFT j = 1j∈{DPFT,PPFT}) with domain controls and participant random intercepts (Tab. SI.36), and additionally test fine-tuning × domain interactions (Tab. SI.37). 84 Autograding User and Model Traits Model Trait Characterisation
[30]

Length attenuation: We re-estimate the core model-level regressions from Appen- dices SI.5.2 to SI.5.4 with response length as an additional covariate (Tab. SI.38)
[31]

Predictors of human preference: We regress preference outcomes (opening choice, ranked-best, preference rating) on the four continuous trait scores, both in raw scale and standardised with a length covariate (Tab. SI.39). First-turn scores are used for opening choice; full-conversation scores for rankings and ratings
[32]

Text-level biases: We fit the same ranked-best trait regression separately for human, Sim-Judgement, and Sim-Dynamic rankings, then formally test for differences via a pooled conditional logit with source × trait interactions (Tab. SI.40). Source main effects are absorbed by stratification; interactions are estimable
[33]

Position biases: We fit conditional logistic regressions with position dummies sepa- rately for human and simulated rankings, with a pooled source × position interaction model to formally test for differences (Tab. SI.41). Kruskal–Wallis tests confirm that all four continuous traits differ significantly across models (all p < . 001; Fig. SI.33): PPFT scor...

2000
[34]

User Sycophancy: excessive agreement with and praise of the assistant’s responses
[35]

User Relationship-Seeking : treating the assistant as a social peer, soliciting its opinions, building mutual ground
[36]

User Self-Disclosure : volunteering personal values, beliefs, opinions, or identity beyond what the query requires
[37]

The autograder receives the user’s persona profile to assess whether messages sound natural for that individual

User Naturalness: whether the user’s messages read as authentic human communi- cation rather than formulaic or artificial. The autograder receives the user’s persona profile to assess whether messages sound natural for that individual
[38]

The autograder receives the user’s persona profile to assess alignment with their stated interests

User Ecological Validity: whether the user engages with the topic in a way that reflects genuine interest rather than performing a task. The autograder receives the user’s persona profile to assess alignment with their stated interests
[39]

role: content

User Persona Parroting : mechanically restating elements of an assigned persona rather than naturally incorporating them. ICC validation. We use the same protocol as App. SI.7.1. Human dimensions achieve high reliability (ICC = 0 .86–0.99, Tab. SI.42). Simulated dimensions are generally reliable (ICC > 0.90), with the exception of simulated user naturalne...

2020
[40]

- Combine the **self_description** to extract core values and worldviews

**Basic Attribute Analysis**: - Extract explicit features in demographics such as **age**, **gender**, **education**, etc. - Combine the **self_description** to extract core values and worldviews. - Analyze the style preferences in **system_string** to understand how the user wants the AI to communicate
[41]

This includes understanding their preference for structure, position, or information density

**Conversation History Analysis**: - Compare each group of **chosen_utterance** and **rejected_utterance** to identify the user’s content preferences. This includes understanding their preference for structure, position, or information density. - Pay particular attention to the **open_feedback** provided by the user to identify additional expectations and...
[42]

- Identify stable selection patterns across various conversations

**Comprehensive Pattern Recognition**: - Establish associations between the user’s explicit preferences and implicit tendencies. - Identify stable selection patterns across various conversations. - Infer unstated or hidden preferences based on behavior and feedback. --- **Output Specifications**: - **Tone**: Use an academic neutral tone to maintain object...