Users as Annotators: LLM Preference Learning from Comparison Mode
Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3
The pith
Generating responses from two different models lets an EM algorithm estimate each user's latent quality factor to filter preference data for LLM alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the asymmetry created by generating responses from two different models or versions of the same model allows reliable inference of each user's latent quality factor through a proposed user behavior model, which is estimated using an expectation-maximization algorithm to filter annotation data for improved LLM alignment.
What carries the argument
Latent quality factor inside a user behavior model, recovered by an expectation-maximization algorithm that exploits the asymmetry between the two response generators.
If this is right
- Filtered user annotations improve performance on downstream LLM alignment tasks.
- The approach captures systematic differences in how individual users label preferences.
- Ordinary user labels become a viable supplement to professionally annotated preference data.
- Quality control for large-scale preference collection no longer requires additional human review.
Where Pith is reading between the lines
- The same asymmetry idea could be tested with responses generated under different prompts or decoding strategies rather than different models.
- Real-time deployment might allow an LLM to maintain a running quality estimate for each recurring user and down-weight noisy feedback on the fly.
- If quality factors correlate with observable user traits such as query complexity or session length, the model could be extended to predict reliability without waiting for many labels.
- The method offers a template for quality-aware aggregation in other crowdsourced judgment settings where the items being judged have natural variation in difficulty.
Load-bearing premise
The proposed user behavior model and the asymmetry created by generating responses from two different models or versions are sufficient to allow reliable inference of each user's latent quality factor from the observed preference labels.
What would settle it
If the subset of user data retained after EM-based filtering produces equal or worse downstream alignment performance than the unfiltered user data, the inference procedure has not isolated reliable signals.
Figures
read the original abstract
Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an alternative to professional annotators for collecting pairwise preference data for LLM alignment: users provide binary preferences on responses generated from two different models (or versions) in comparison mode. It introduces a user behavior model with a latent scalar quality factor q_u per user, estimated via an expectation-maximization algorithm that exploits the induced asymmetry, then filters annotations by this factor. The central claim is that this captures user behavior and yields higher-quality data for downstream LLM alignment.
Significance. If the latent quality factor proves identifiable and the filtering produces measurable gains, the work could meaningfully scale preference data collection by leveraging everyday user interactions rather than costly professional annotation. The asymmetry-based inference and EM procedure are technically interesting contributions to user modeling in alignment pipelines.
major comments (2)
- [§4] §4 (EM procedure for latent quality): The generative model assumes that the per-user quality factor q_u is identifiable from binary preferences given the model-pair asymmetry. No derivation, likelihood analysis, or simulation is provided to rule out invariance under trade-offs between q_u and the unknown strength gap between the two models. Without this, recovered q_u values may be non-unique or degenerate, directly undermining the filtering step that is load-bearing for the central claim.
- [Experiments] Experiments section: The abstract asserts downstream effectiveness for both user-behavior capture and alignment, yet the provided description contains no quantitative metrics, baselines, ablation on the filtering threshold, or error analysis. Central claims therefore rest on unshown results, preventing assessment of whether gains exceed what could be obtained by simpler heuristics.
minor comments (2)
- [§3] Notation for the user behavior model (e.g., the exact functional form linking q_u to preference probability) should be stated explicitly with all parameters listed to allow reproduction.
- [Figures] Figure captions for any EM convergence or quality-distribution plots could clarify axis labels and the precise filtering criterion applied.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our latent user quality model and the need for clearer empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (EM procedure for latent quality): The generative model assumes that the per-user quality factor q_u is identifiable from binary preferences given the model-pair asymmetry. No derivation, likelihood analysis, or simulation is provided to rule out invariance under trade-offs between q_u and the unknown strength gap between the two models. Without this, recovered q_u values may be non-unique or degenerate, directly undermining the filtering step that is load-bearing for the central claim.
Authors: We agree that the current version does not include a formal identifiability analysis. The model exploits the fixed asymmetry between the two response-generating models (e.g., different versions or architectures) to break the symmetry that would otherwise exist in standard pairwise data. In the revision we will add a dedicated subsection deriving the likelihood and showing that q_u is identifiable up to a monotonic transformation once the model-pair strength gap is treated as a fixed but unknown parameter; we will also include a short simulation study confirming that the EM procedure recovers relative user rankings reliably across different gap sizes. This directly supports the filtering step. revision: yes
-
Referee: [Experiments] Experiments section: The abstract asserts downstream effectiveness for both user-behavior capture and alignment, yet the provided description contains no quantitative metrics, baselines, ablation on the filtering threshold, or error analysis. Central claims therefore rest on unshown results, preventing assessment of whether gains exceed what could be obtained by simpler heuristics.
Authors: The referee is correct that the version under review presents the experimental claims without sufficient quantitative detail. The full manuscript does contain results on user-quality inference accuracy and downstream win-rate improvements after filtering, but these were not adequately summarized in the sections visible to the referee. In the revision we will expand the Experiments section to report concrete metrics (e.g., AUC for user-quality prediction, alignment win rates), include baselines such as random filtering and simple majority-vote heuristics, add an ablation on the filtering threshold, and provide error analysis. These additions will allow direct comparison against simpler alternatives. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a generative user behavior model that exploits response asymmetry from two distinct models/versions to infer a per-user latent quality factor q_u via a standard EM procedure, then applies the resulting estimates to filter annotations before a separate downstream LLM alignment task. This chain is self-contained: the identifiability claim rests on the explicit modeling assumption of model-pair strength differences rather than on any re-use of the target result, and downstream performance is evaluated on an external alignment objective rather than on a quantity that is definitionally identical to the fitted q_u values. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent user quality factor
axioms (1)
- domain assumption Generating responses from two different models creates observable asymmetry that reveals user annotation quality.
invented entities (1)
-
latent user quality factor
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosureabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Author’s sentiment prediction.arXiv preprint arXiv:2011.06128,
Mohaddeseh Bastan, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, and Niranjan Balasubramanian. Author’s sentiment prediction.arXiv preprint arXiv:2011.06128,
-
[4]
Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, YiningChen, AdrienEcoffet, ManasJoglekar, JanLeike, etal. Weak-to-stronggeneralization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,
-
[5]
Creating speech and language data with amazon’s mechanical turk
Chris Callison-Burch and Mark Dredze. Creating speech and language data with amazon’s mechanical turk. InProceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pages 1–12,
work page 2010
-
[6]
Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409,
Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409,
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Mechanism design for large language models
Paul Duetting, Vahab Mirrokni, Renato Paes Leme, Haifeng Xu, and Song Zuo. Mechanism design for large language models. InProceedings of the ACM on Web Conference 2024, pages 144–155,
work page 2024
-
[9]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models.arXiv preprint arXiv:2404.09824,
-
[11]
DeepanwayGhosal, SiqiShen, NavonilMajumder, RadaMihalcea, andSoujanyaPoria. Cicero: Adataset for contextualized commonsense inference in dialogues.arXiv preprint arXiv:2203.13926,
-
[12]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
ISSN 1932-8346. doi: 10.1561/2000000034. URLhttp: //dx.doi.org/10.1561/2000000034. Shugang Hao and Lingjie Duan. Online learning from strategic human feedback in llm fine-tuning.arXiv preprint arXiv:2412.16834,
-
[14]
14 Keegan Harris, Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Algorithmic persuasion through simulation: Information design in the age of generative ai.arXiv preprint arXiv:2311.18138,
-
[15]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[16]
Jan-Christoph Klie, Richard Eckart de Castilho, and Iryna Gurevych. Analyzing dataset annotation quality management in the wild.Computational Linguistics, 50(3):817–866, 2024a. Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, and Rahul Nair. On efficient and statistical quality estimation for data annotation.arXiv preprint arXiv:2405.11919, 2024b. Klaus...
-
[17]
Chu Li, Zhihan Zhang, Michael Saugstad, Esteban Safranchik, Chaitanyashareef Kulkarni, Xiaoyu Huang, Shwetak Patel, Vikram Iyer, Tim Althoff, and Jon E Froehlich. Labelaid: Just-in-time ai interventions for improving human labeling quality and domain knowledge in crowdsourcing systems. InProceedings of the 2024 CHI Conference on Human Factors in Computing...
work page 2024
-
[18]
Xize Liang, Chao Chen, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, and Jieping Ye. Robust preference optimization with provable noise tolerance for llms.arXiv preprint arXiv:2404.04102,
-
[19]
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators
Shang Liu, Hanzhao Wang, Zhongyao Ma, and Xiaocheng Li. How humans help llms: Assessing and incentivizing human preference annotators.arXiv preprint arXiv:2502.06387,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
doi: https:// doi.org/10.1016/0024-3795(94)90363-8
ISSN 0024-3795. doi: https:// doi.org/10.1016/0024-3795(94)90363-8. URLhttps://www.sciencedirect.com/science/article/ pii/0024379594903638. Special Issue Honoring Ingram Olkin. Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method.Management Science, 51(9):1359–1373,
-
[22]
Annotation inconsistency and entity bias in multiwoz.arXiv preprint arXiv:2105.14150,
Kun Qian, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, and Chinnadhurai Sankar. Annotation inconsistency and entity bias in multiwoz.arXiv preprint arXiv:2105.14150,
-
[23]
Haoran Sun, Yurong Chen, Siwei Wang, Wei Chen, and Xiaotie Deng
URLhttps://openreview.net/forum?id=wZgw4CrxwK. Haoran Sun, Yurong Chen, Siwei Wang, Wei Chen, and Xiaotie Deng. Mechanism design for llm fine- tuning with multiple reward models.arXiv preprint arXiv:2405.16276,
-
[24]
LLaMA: Open and Efficient Foundation Language Models
URLhttps://qwenlm.github. io/blog/qwen2.5/. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and ef- ficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Secrets of rlhf in large language models part ii: Reward modeling
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080,
-
[26]
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages.arXiv preprint arXiv:2505.11475,
-
[27]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[28]
16 A Related Work In this section, we review literature relevant to our work. A.1 Aligning LLMs with Human Preferences Post-training alignment of large language models (LLMs) typically involves two stages: supervised fine- tuning (SFT) with expert demonstrations, followed by preference-based optimization. The latter stage aligns models with human values b...
work page 2017
-
[29]
and Klie et al. [2024a]. A fundamental challenge in this area is the inherent subjectivity of preference data: ground-truth labels are often nonexistent, and the direct impact of annotation quality on downstream task performance can be difficult to measure. Our work confronts these challenges directly. We introduce a novel framework that leverages ordinar...
work page 2024
-
[30]
established its key properties: monotonic convergence of the observed-data likelihood and, under mild regularity conditions, convergence to a stationary point. Subsequent research has analyzed its rate of convergence [Meng and Rubin, 1994] and developed finite-sample guarantees [Balakrishnan et al., 2017]. B Background on RLHF and DPO Reinforcement Learni...
work page 1994
-
[31]
The true model parameters are denoted byθ∗
0.60 0.40 0.980.5% 3.01 5.020.4% Table 4: The relative estimation error of Algorithm 1 under varying sample sizes. The true model parameters are denoted byθ∗. In this experiment, allnj are fixed to a common valuen. Each row reports the estimation results for a specific(m, n)combination. Each column lists the estimated parameters, while the∆column shows th...
work page 2011
-
[32]
Specifically, we set the prior distribution ofµto Beta(8,2)and define R(θ) = 7·logη+ log(1−η). In Figure 8, the two methods are termed “Knowingµ” and “Prior”. •Data filtering.Based on the estimated parameters, we identify and filter out the attentive users. For each userj, an estimated attentiveness levelˆηj is obtained via the maximum a posteriori (MAP) ...
work page 2020
-
[33]
E.2 Uniform Law of Large Numbers Let(z 1, . . . , zn)be i.i.d. random variables on a measurable space(Z,A)with common distributionP. LetΘ⊂R d be compact and let F={f θ :Z →R k :θ∈Θ}, k≥1, be a function family. Forf:Z →R k, write Pnf= 1 n nX i=1 f(z i), P f=E P [f(z)]. This section formalizes the notion of uniform convergence of empirical averages to their...
work page 1996
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.