pith. sign in

arxiv: 2604.17289 · v1 · submitted 2026-04-19 · 💻 cs.LG

REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

Pith reviewed 2026-05-10 06:48 UTC · model grok-4.3

classification 💻 cs.LG
keywords noisy annotationsannotator expertiselanguage model fine-tuningunsupervised learningcrowdsourced dataquestion answeringmixture modelmulti-task learning
0
0 comments X

The pith

REALM jointly learns model parameters and per-annotator expertise scalars unsupervised by modeling each label as a mixture of the current prediction and uniform noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that fine-tuning language models on crowd-sourced labels can avoid absorbing errors from unreliable annotators by learning a scalar expertise weight for each worker during training itself. A sympathetic reader would care because standard practice either discards annotator identity through majority vote or trains directly on noisy data, both of which embed mistakes into the model parameters. The method requires no extra supervision beyond knowing which annotator produced each label and extends to multiple tasks by learning an expertise matrix. Experiments on question-answering benchmarks with simulated noise show the approach beats naive supervised fine-tuning in most settings, with larger gains when noise is high or models are bigger.

Core claim

REALM jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to 50% in the most adversarial regime.

What carries the argument

A per-annotator scalar that weights the current model prediction against a uniform random guess to explain each observed label during joint optimization.

Load-bearing premise

Annotator behavior can be captured by a single scalar expertise weight mixing the model's current prediction with a uniform random guess, and this mixture remains a good model of the data-generating process throughout training.

What would settle it

Apply the method to a dataset where annotators exhibit consistent, non-uniform error patterns on specific question subtypes that cannot be explained by a single overall expertise scalar and verify whether accuracy gains over naive fine-tuning disappear.

Figures

Figures reproduced from arXiv: 2604.17289 by Mahnoosh Alizadeh, Mark Beliaev, Ramtin Pedarsani, Sajjad Ghiasvand.

Figure 1
Figure 1. Figure 1: Test accuracy (%) over training steps under Dist. [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
read the original abstract

Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes REALM, which jointly optimizes language model parameters and per-annotator expertise scalars (or an expertise matrix in the multi-task extension) by modeling each observed label as an expertise-weighted mixture of the model's current prediction and a uniform random distribution. It evaluates the approach on five question-answering benchmarks by fine-tuning Flan-T5 models of three sizes under simulated noisy annotations, reporting consistent outperformance over naive noisy supervised fine-tuning, with gains up to 50% in high-noise regimes that increase with model capacity.

Significance. If the gains hold under realistic annotator noise, REALM would provide a practical unsupervised way to leverage heterogeneous crowd annotations without discarding annotator identity or requiring gold-standard expertise labels. The joint optimization and multi-task matrix extension are conceptually clean, and the reported scaling of gains with model size is a positive signal worth further investigation.

major comments (2)
  1. [§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.
  2. [§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.
minor comments (2)
  1. The abstract and results tables should report the number of random seeds, standard deviations, and any statistical significance tests for the accuracy improvements.
  2. [§5] Clarify in §5 which specific dataset, noise level, and model size produce the 'up to 50%' improvement cited in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the work.

read point-by-point responses
  1. Referee: [§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.

    Authors: We agree that the current experiments rely on noise generated from the exact mixture process assumed by REALM, providing a controlled test of the core mechanism but leaving open questions about robustness to misspecification. In the revised manuscript we will add experiments that simulate class-conditional confusion matrices and position biases drawn from realistic crowd-annotation patterns. These new results will be reported alongside the existing matched-noise results to quantify performance under more varied noise structures. revision: yes

  2. Referee: [§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.

    Authors: The constant scalar per annotator is a modeling choice that keeps the approach fully unsupervised and scalable. While the manuscript does not currently contain explicit tracking of scalar trajectories, the reported accuracy gains that increase with model capacity offer supporting evidence that expertise recovery remains effective. We will add, in the revision, plots of learned expertise scalars over training epochs on multiple datasets together with a short analysis confirming convergence and separation from model-error dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on explicit modeling and external evaluation

full rationale

The paper introduces REALM via an explicit generative assumption (label as expertise-weighted mixture of current model prediction and uniform) and jointly optimizes model parameters plus per-annotator scalars (or matrix in multi-task). No equation or step equates the reported accuracy gains to the inputs by construction; the mixture is an ansatz, not a tautology, and the performance numbers are obtained by running the optimizer on held-out QA benchmarks under simulated noise. Because the central empirical claims are not forced by re-labeling fitted quantities as predictions and no self-citation chain is load-bearing for the uniqueness of the approach, the derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mixture model being a faithful representation of annotator behavior and on the optimization successfully recovering meaningful expertise values from noisy labels alone.

free parameters (1)
  • per-annotator expertise scalars
    Learned unsupervised from label data; central to the mixture weighting.
axioms (1)
  • domain assumption Annotator reliability is constant across examples and can be summarized by a single scalar
    Invoked when defining the mixture weight for each label.

pith-pipeline@v0.9.0 · 5512 in / 1262 out tokens · 27836 ms · 2026-05-10T06:48:04.907329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

  2. [2]

    Scaling instruction-finetuned language models,

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

  3. [3]

    Fine-Tuning Language Models from Human Preferences

    D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Chris- tiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan,et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022

  5. [5]

    Maximum likelihood estimation of observer error-rates using the em algorithm,

    A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,”Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979

  6. [6]

    Learning from crowds.,

    V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds.,”Journal of machine learning research, vol. 11, no. 4, 2010

  7. [7]

    LIMA: Less is more for alignment,

    C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: Less is more for alignment,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023

  8. [8]

    Learning with noisy labels,

    N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,”Advances in neural information processing systems, vol. 26, 2013

  9. [9]

    Co-teaching: Robust training of deep neural networks with extremely noisy labels,

    B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,”Advances in neural information processing systems, vol. 31, 2018

  10. [10]

    Making deep neural networks robust to label noise: A loss correction approach,

    G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952, 2017

  11. [11]

    Imitation learn- ing by estimating expertise of demonstrators,

    M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani, “Imitation learn- ing by estimating expertise of demonstrators,” inInternational Conference on Machine Learning, pp. 1732–1748, PMLR, 2022

  12. [12]

    Inverse reinforcement learning by estimating expertise of demonstrators,

    M. Beliaev and R. Pedarsani, “Inverse reinforcement learning by estimating expertise of demonstrators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 15532–15540, 2025. 13

  13. [13]

    Can a suit of armor conduct electricity? a new dataset for open book question answering,

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inProceedings of EMNLP, 2018

  14. [14]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” inarXiv preprint arXiv:1803.05457, 2018

  15. [15]

    PIQA: Reasoning about physical commonsense in natural language,

    Y. Bisk, R. Zellers, J. Gao, and Y. Choi, “PIQA: Reasoning about physical commonsense in natural language,” inProceedings of AAAI, 2020

  16. [16]

    RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,

    B.Y.Lin, Z.Wu, Y.Yang, D.-H.Lee, andX.Ren, “RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,” in Findings of ACL, 2021

  17. [17]

    PubMedQA: A dataset for biomedical research question answering,

    Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2567–2577, 2019

  18. [18]

    Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,

    J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” inAdvances in Neural Information Processing Systems, vol. 22, pp. 2035–2043, 2009

  19. [19]

    Dealing with disagreements: Looking beyond the majority vote in subjective annotations,

    A. M. Davani, M. Díaz, and V. Prabhakaran, “Dealing with disagreements: Looking beyond the majority vote in subjective annotations,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022

  20. [20]

    Jury learning: Integrating dissenting voices into machine learning models,

    M. L. Gordon, M. S. Lam, J. S. Park, K. Patel, J. Hancock, T. Hashimoto, and M. S. Bernstein, “Jury learning: Integrating dissenting voices into machine learning models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–19, 2022

  21. [21]

    The “problem

    B. Plank, “The “problem” of human label variation: On ground truth in data, modeling and evaluation,” inProceedings of the 2022 conference on empirical methods in natural language processing, pp. 10671–10682, 2022

  22. [22]

    Learning from disagreement: A survey,

    A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio, “Learning from disagreement: A survey,”Journal of Artificial Intelligence Research, vol. 72, pp. 1385–1470, 2021

  23. [23]

    arXiv preprint arXiv:2412.14922 , year=

    J. Luo, X. Luo, and K. Ding, “RobustFT: Robust supervised fine-tuning for large language models under noisy response,”arXiv preprint arXiv:2412.14922, 2024

  24. [24]

    Noise-robust fine-tuning of pretrained language models via external guidance,

    S. Wang, Z. Tan, and R. Guo, “Noise-robust fine-tuning of pretrained language models via external guidance,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

  25. [25]

    SymNoise: Advancing language model fine-tuning with symmetric noise,

    A. K. Yadav and A. Singh, “SymNoise: Advancing language model fine-tuning with symmetric noise,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. 14

  26. [26]

    Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,

    Y. Yu, S. Zuo, and H. Jiang, “Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 2021

  27. [27]

    Efficient training of artificial neural networks for autonomous navigation,

    D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,”Neural computation, vol. 3, no. 1, pp. 88–97, 1991

  28. [28]

    Confidence-aware imitation learning from demonstrations with varying optimality,

    S. Zhang, Z. Cao, D. Sadigh, and Y. Sui, “Confidence-aware imitation learning from demonstrations with varying optimality,”Advances in Neural Information Processing Systems, vol. 34, pp. 12340–12350, 2021

  29. [29]

    Trail: Near-optimal imitation learning with suboptimal data,

    M. Yang, S. Levine, and O. Nachum, “Trail: Near-optimal imitation learning with suboptimal data,”arXiv preprint arXiv:2110.14770, 2021

  30. [30]

    Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,

    D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,” inConference on robot learning, pp. 330–359, PMLR, 2020

  31. [31]

    Self-instruct: Aligning language models with self-generated instructions,

    Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Ha- jishirzi, “Self-instruct: Aligning language models with self-generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 13484–13508, 2023

  32. [32]

    Learning to summarize with human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

  33. [33]

    Communication-efficient and tensorized federated fine-tuning of large language models,

    S. Ghiasvand, Y. Yang, Z. Xue, M. Alizadeh, Z. Zhang, and R. Pedarsani, “Communication-efficient and tensorized federated fine-tuning of large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 24192–24207, 2025

  34. [34]

    Decentralized low-rank fine-tuning of large language models,

    S. Ghiasvand, M. Alizadeh, and R. Pedarsani, “Decentralized low-rank fine-tuning of large language models,” inProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pp. 334–345, 2025. 15 A Additional Results A.1 Results under Dist. 2○and Dist. 3○ Tables 6 and 7 report results under the remaining two expertise distributions. Un...