pith. machine review for the scientific record. sign in

arxiv: 2510.13830 · v2 · submitted 2025-10-10 · 💻 cs.CL · cs.AI

Users as Annotators: LLM Preference Learning from Comparison Mode

Pith reviewed 2026-05-18 08:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM alignmentpreference learninguser annotationexpectation-maximizationdata filteringcomparison modelatent quality factor
0
0 comments X

The pith

Generating responses from two different models lets an EM algorithm estimate each user's latent quality factor to filter preference data for LLM alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores collecting pairwise preference labels directly from everyday LLM users instead of relying only on professional annotators. Because users know their own queries best, these labels have natural value, yet they lack built-in quality checks. The method creates an asymmetry by producing the two responses in each pair from distinct models or model versions. This asymmetry is then used inside a user behavior model whose latent quality factor for each user is recovered by an expectation-maximization procedure. The recovered factors allow the authors to discard low-quality annotations before the data are used for alignment training.

Core claim

The central claim is that the asymmetry created by generating responses from two different models or versions of the same model allows reliable inference of each user's latent quality factor through a proposed user behavior model, which is estimated using an expectation-maximization algorithm to filter annotation data for improved LLM alignment.

What carries the argument

Latent quality factor inside a user behavior model, recovered by an expectation-maximization algorithm that exploits the asymmetry between the two response generators.

If this is right

  • Filtered user annotations improve performance on downstream LLM alignment tasks.
  • The approach captures systematic differences in how individual users label preferences.
  • Ordinary user labels become a viable supplement to professionally annotated preference data.
  • Quality control for large-scale preference collection no longer requires additional human review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same asymmetry idea could be tested with responses generated under different prompts or decoding strategies rather than different models.
  • Real-time deployment might allow an LLM to maintain a running quality estimate for each recurring user and down-weight noisy feedback on the fly.
  • If quality factors correlate with observable user traits such as query complexity or session length, the model could be extended to predict reliability without waiting for many labels.
  • The method offers a template for quality-aware aggregation in other crowdsourced judgment settings where the items being judged have natural variation in difficulty.

Load-bearing premise

The proposed user behavior model and the asymmetry created by generating responses from two different models or versions are sufficient to allow reliable inference of each user's latent quality factor from the observed preference labels.

What would settle it

If the subset of user data retained after EM-based filtering produces equal or worse downstream alignment performance than the unfiltered user data, the inference procedure has not isolated reliable signals.

Figures

Figures reproduced from arXiv: 2510.13830 by Xiaocheng Li, Zhongze Cai.

Figure 1
Figure 1. Figure 1: ChatGPT’s comparison mode. Two responses are generated, and the user clicks the preferred [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed pipeline estimates the quality of users’ preference labels in the comparison mode. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The iterative updates of Algorithm 1 under the model settings of Example 1 (left) and Example 2 (right). The true model for Example 1 is Pη = 0.6·δ0.4+0.4·δ0.98 and for Example 2 is Pη = Beta(3, 5). The x- and y-axes represent the estimated parameter values, with the red star indicating the true parameters. The algorithm is repeated for multiple trials, each starting from different initialization points (m… view at source ↗
Figure 4
Figure 4. Figure 4: The estimation of Pη under model misspecification. In the left panel, the “True Distribution” Pη is a mixture of two Beta distributions, while the “Estimation” is taken from the two-point discrete family. In the right panel, the “True Distribution” follows a logistic-normal distribution, while the “Estimation” is taken from the Beta family. In both cases, Algorithm 1 approximates Pη reasonably well: in the… view at source ↗
Figure 5
Figure 5. Figure 5: The histogram of reward scores for different models, evaluated on the [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DPO performance under different filtering strategies. The x-axis represents filtering strategies [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Approximating various true distributions [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy in recovering the attentive users. Data are generated under [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an alternative to professional annotators for collecting pairwise preference data for LLM alignment: users provide binary preferences on responses generated from two different models (or versions) in comparison mode. It introduces a user behavior model with a latent scalar quality factor q_u per user, estimated via an expectation-maximization algorithm that exploits the induced asymmetry, then filters annotations by this factor. The central claim is that this captures user behavior and yields higher-quality data for downstream LLM alignment.

Significance. If the latent quality factor proves identifiable and the filtering produces measurable gains, the work could meaningfully scale preference data collection by leveraging everyday user interactions rather than costly professional annotation. The asymmetry-based inference and EM procedure are technically interesting contributions to user modeling in alignment pipelines.

major comments (2)
  1. [§4] §4 (EM procedure for latent quality): The generative model assumes that the per-user quality factor q_u is identifiable from binary preferences given the model-pair asymmetry. No derivation, likelihood analysis, or simulation is provided to rule out invariance under trade-offs between q_u and the unknown strength gap between the two models. Without this, recovered q_u values may be non-unique or degenerate, directly undermining the filtering step that is load-bearing for the central claim.
  2. [Experiments] Experiments section: The abstract asserts downstream effectiveness for both user-behavior capture and alignment, yet the provided description contains no quantitative metrics, baselines, ablation on the filtering threshold, or error analysis. Central claims therefore rest on unshown results, preventing assessment of whether gains exceed what could be obtained by simpler heuristics.
minor comments (2)
  1. [§3] Notation for the user behavior model (e.g., the exact functional form linking q_u to preference probability) should be stated explicitly with all parameters listed to allow reproduction.
  2. [Figures] Figure captions for any EM convergence or quality-distribution plots could clarify axis labels and the precise filtering criterion applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our latent user quality model and the need for clearer empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (EM procedure for latent quality): The generative model assumes that the per-user quality factor q_u is identifiable from binary preferences given the model-pair asymmetry. No derivation, likelihood analysis, or simulation is provided to rule out invariance under trade-offs between q_u and the unknown strength gap between the two models. Without this, recovered q_u values may be non-unique or degenerate, directly undermining the filtering step that is load-bearing for the central claim.

    Authors: We agree that the current version does not include a formal identifiability analysis. The model exploits the fixed asymmetry between the two response-generating models (e.g., different versions or architectures) to break the symmetry that would otherwise exist in standard pairwise data. In the revision we will add a dedicated subsection deriving the likelihood and showing that q_u is identifiable up to a monotonic transformation once the model-pair strength gap is treated as a fixed but unknown parameter; we will also include a short simulation study confirming that the EM procedure recovers relative user rankings reliably across different gap sizes. This directly supports the filtering step. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts downstream effectiveness for both user-behavior capture and alignment, yet the provided description contains no quantitative metrics, baselines, ablation on the filtering threshold, or error analysis. Central claims therefore rest on unshown results, preventing assessment of whether gains exceed what could be obtained by simpler heuristics.

    Authors: The referee is correct that the version under review presents the experimental claims without sufficient quantitative detail. The full manuscript does contain results on user-quality inference accuracy and downstream win-rate improvements after filtering, but these were not adequately summarized in the sections visible to the referee. In the revision we will expand the Experiments section to report concrete metrics (e.g., AUC for user-quality prediction, alignment win rates), include baselines such as random filtering and simple majority-vote heuristics, add an ablation on the filtering threshold, and provide error analysis. These additions will allow direct comparison against simpler alternatives. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a generative user behavior model that exploits response asymmetry from two distinct models/versions to infer a per-user latent quality factor q_u via a standard EM procedure, then applies the resulting estimates to filter annotations before a separate downstream LLM alignment task. This chain is self-contained: the identifiability claim rests on the explicit modeling assumption of model-pair strength differences rather than on any re-use of the target result, and downstream performance is evaluated on an external alignment objective rather than on a quantity that is definitionally identical to the fitted q_u values. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a newly introduced user behavior model whose parameters are estimated from the data itself; no external benchmarks or shipped code are mentioned.

free parameters (1)
  • latent user quality factor
    Estimated per user via EM to capture annotation reliability; value is data-dependent.
axioms (1)
  • domain assumption Generating responses from two different models creates observable asymmetry that reveals user annotation quality.
    Invoked to justify the inference step in the user behavior model.
invented entities (1)
  • latent user quality factor no independent evidence
    purpose: To represent and estimate each user's annotation reliability for filtering.
    Newly postulated statistical construct with no independent evidence outside the model fit.

pith-pipeline@v0.9.0 · 5742 in / 1193 out tokens · 17630 ms · 2026-05-18T08:16:54.654579+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  3. [3]

    Author’s sentiment prediction.arXiv preprint arXiv:2011.06128,

    Mohaddeseh Bastan, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, and Niranjan Balasubramanian. Author’s sentiment prediction.arXiv preprint arXiv:2011.06128,

  4. [4]

    Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu, and OpenAI

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, YiningChen, AdrienEcoffet, ManasJoglekar, JanLeike, etal. Weak-to-stronggeneralization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

  5. [5]

    Creating speech and language data with amazon’s mechanical turk

    Chris Callison-Burch and Mark Dredze. Creating speech and language data with amazon’s mechanical turk. InProceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pages 1–12,

  6. [6]

    Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409,

    Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback.arXiv preprint arXiv:2403.00409,

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  8. [8]

    Mechanism design for large language models

    Paul Duetting, Vahab Mirrokni, Renato Paes Leme, Haifeng Xu, and Song Zuo. Mechanism design for large language models. InProceedings of the ACM on Web Conference 2024, pages 144–155,

  9. [9]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

  10. [10]

    Impact of preference noise on the alignment performance of generative language models.arXiv preprint arXiv:2404.09824,

    Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models.arXiv preprint arXiv:2404.09824,

  11. [11]

    Cicero: Adataset for contextualized commonsense inference in dialogues.arXiv preprint arXiv:2203.13926,

    DeepanwayGhosal, SiqiShen, NavonilMajumder, RadaMihalcea, andSoujanyaPoria. Cicero: Adataset for contextualized commonsense inference in dialogues.arXiv preprint arXiv:2203.13926,

  12. [12]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  13. [13]

    doi: 10.1561/2000000034

    ISSN 1932-8346. doi: 10.1561/2000000034. URLhttp: //dx.doi.org/10.1561/2000000034. Shugang Hao and Lingjie Duan. Online learning from strategic human feedback in llm fine-tuning.arXiv preprint arXiv:2412.16834,

  14. [14]

    Algorithmic persuasion through simulation: Information design in the age of generative ai.arXiv preprint arXiv:2311.18138,

    14 Keegan Harris, Nicole Immorlica, Brendan Lucier, and Aleksandrs Slivkins. Algorithmic persuasion through simulation: Information design in the age of generative ai.arXiv preprint arXiv:2311.18138,

  15. [15]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  16. [16]

    Analyzing dataset annotation quality management in the wild.Computational Linguistics, 50(3):817–866, 2024a

    Jan-Christoph Klie, Richard Eckart de Castilho, and Iryna Gurevych. Analyzing dataset annotation quality management in the wild.Computational Linguistics, 50(3):817–866, 2024a. Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, and Rahul Nair. On efficient and statistical quality estimation for data annotation.arXiv preprint arXiv:2405.11919, 2024b. Klaus...

  17. [17]

    Labelaid: Just-in-time ai interventions for improving human labeling quality and domain knowledge in crowdsourcing systems

    Chu Li, Zhihan Zhang, Michael Saugstad, Esteban Safranchik, Chaitanyashareef Kulkarni, Xiaoyu Huang, Shwetak Patel, Vikram Iyer, Tim Althoff, and Jon E Froehlich. Labelaid: Just-in-time ai interventions for improving human labeling quality and domain knowledge in crowdsourcing systems. InProceedings of the 2024 CHI Conference on Human Factors in Computing...

  18. [18]

    Robust preference optimization with provable noise tolerance for llms.arXiv preprint arXiv:2404.04102,

    Xize Liang, Chao Chen, Jie Wang, Yue Wu, Zhihang Fu, Zhihao Shi, Feng Wu, and Jieping Ye. Robust preference optimization with provable noise tolerance for llms.arXiv preprint arXiv:2404.04102,

  19. [19]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451,

  20. [20]

    How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

    Shang Liu, Hanzhao Wang, Zhongyao Ma, and Xiaocheng Li. How humans help llms: Assessing and incentivizing human preference annotators.arXiv preprint arXiv:2502.06387,

  21. [21]

    doi: https:// doi.org/10.1016/0024-3795(94)90363-8

    ISSN 0024-3795. doi: https:// doi.org/10.1016/0024-3795(94)90363-8. URLhttps://www.sciencedirect.com/science/article/ pii/0024379594903638. Special Issue Honoring Ingram Olkin. Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peer-prediction method.Management Science, 51(9):1359–1373,

  22. [22]

    Annotation inconsistency and entity bias in multiwoz.arXiv preprint arXiv:2105.14150,

    Kun Qian, Ahmad Beirami, Zhouhan Lin, Ankita De, Alborz Geramifard, Zhou Yu, and Chinnadhurai Sankar. Annotation inconsistency and entity bias in multiwoz.arXiv preprint arXiv:2105.14150,

  23. [23]

    Haoran Sun, Yurong Chen, Siwei Wang, Wei Chen, and Xiaotie Deng

    URLhttps://openreview.net/forum?id=wZgw4CrxwK. Haoran Sun, Yurong Chen, Siwei Wang, Wei Chen, and Xiaotie Deng. Mechanism design for llm fine- tuning with multiple reward models.arXiv preprint arXiv:2405.16276,

  24. [24]

    LLaMA: Open and Efficient Foundation Language Models

    URLhttps://qwenlm.github. io/blog/qwen2.5/. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and ef- ficient foundation language models.arXiv preprint arXiv:2302.13971,

  25. [25]

    Secrets of rlhf in large language models part ii: Reward modeling

    Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080,

  26. [26]

    Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages.arXiv preprint arXiv:2505.11475,

    Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, and Oleksii Kuchaiev. Helpsteer3-preference: Open human-annotated preference data across diverse tasks and languages.arXiv preprint arXiv:2505.11475,

  27. [27]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

  28. [28]

    gold standard

    16 A Related Work In this section, we review literature relevant to our work. A.1 Aligning LLMs with Human Preferences Post-training alignment of large language models (LLMs) typically involves two stages: supervised fine- tuning (SFT) with expert demonstrations, followed by preference-based optimization. The latter stage aligns models with human values b...

  29. [29]

    gold standard

    and Klie et al. [2024a]. A fundamental challenge in this area is the inherent subjectivity of preference data: ground-truth labels are often nonexistent, and the direct impact of annotation quality on downstream task performance can be difficult to measure. Our work confronts these challenges directly. We introduce a novel framework that leverages ordinar...

  30. [30]

    strength

    established its key properties: monotonic convergence of the observed-data likelihood and, under mild regularity conditions, convergence to a stationary point. Subsequent research has analyzed its rate of convergence [Meng and Rubin, 1994] and developed finite-sample guarantees [Balakrishnan et al., 2017]. B Background on RLHF and DPO Reinforcement Learni...

  31. [31]

    The true model parameters are denoted byθ∗

    0.60 0.40 0.980.5% 3.01 5.020.4% Table 4: The relative estimation error of Algorithm 1 under varying sample sizes. The true model parameters are denoted byθ∗. In this experiment, allnj are fixed to a common valuen. Each row reports the estimation results for a specific(m, n)combination. Each column lists the estimated parameters, while the∆column shows th...

  32. [32]

    Knowingµ

    Specifically, we set the prior distribution ofµto Beta(8,2)and define R(θ) = 7·logη+ log(1−η). In Figure 8, the two methods are termed “Knowingµ” and “Prior”. •Data filtering.Based on the estimated parameters, we identify and filter out the attentive users. For each userj, an estimated attentiveness levelˆηj is obtained via the maximum a posteriori (MAP) ...

  33. [33]

    Envelope

    E.2 Uniform Law of Large Numbers Let(z 1, . . . , zn)be i.i.d. random variables on a measurable space(Z,A)with common distributionP. LetΘ⊂R d be compact and let F={f θ :Z →R k :θ∈Θ}, k≥1, be a function family. Forf:Z →R k, write Pnf= 1 n nX i=1 f(z i), P f=E P [f(z)]. This section formalizes the notion of uniform convergence of empirical averages to their...