REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations
Pith reviewed 2026-05-10 06:48 UTC · model grok-4.3
The pith
REALM jointly learns model parameters and per-annotator expertise scalars unsupervised by modeling each label as a mixture of the current prediction and uniform noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
REALM jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to 50% in the most adversarial regime.
What carries the argument
A per-annotator scalar that weights the current model prediction against a uniform random guess to explain each observed label during joint optimization.
Load-bearing premise
Annotator behavior can be captured by a single scalar expertise weight mixing the model's current prediction with a uniform random guess, and this mixture remains a good model of the data-generating process throughout training.
What would settle it
Apply the method to a dataset where annotators exhibit consistent, non-uniform error patterns on specific question subtypes that cannot be explained by a single overall expertise scalar and verify whether accuracy gains over naive fine-tuning disappear.
Figures
read the original abstract
Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes REALM, which jointly optimizes language model parameters and per-annotator expertise scalars (or an expertise matrix in the multi-task extension) by modeling each observed label as an expertise-weighted mixture of the model's current prediction and a uniform random distribution. It evaluates the approach on five question-answering benchmarks by fine-tuning Flan-T5 models of three sizes under simulated noisy annotations, reporting consistent outperformance over naive noisy supervised fine-tuning, with gains up to 50% in high-noise regimes that increase with model capacity.
Significance. If the gains hold under realistic annotator noise, REALM would provide a practical unsupervised way to leverage heterogeneous crowd annotations without discarding annotator identity or requiring gold-standard expertise labels. The joint optimization and multi-task matrix extension are conceptually clean, and the reported scaling of gains with model size is a positive signal worth further investigation.
major comments (2)
- [§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.
- [§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.
minor comments (2)
- The abstract and results tables should report the number of random seeds, standard deviations, and any statistical significance tests for the accuracy improvements.
- [§5] Clarify in §5 which specific dataset, noise level, and model size produce the 'up to 50%' improvement cited in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the work.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.
Authors: We agree that the current experiments rely on noise generated from the exact mixture process assumed by REALM, providing a controlled test of the core mechanism but leaving open questions about robustness to misspecification. In the revised manuscript we will add experiments that simulate class-conditional confusion matrices and position biases drawn from realistic crowd-annotation patterns. These new results will be reported alongside the existing matched-noise results to quantify performance under more varied noise structures. revision: yes
-
Referee: [§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.
Authors: The constant scalar per annotator is a modeling choice that keeps the approach fully unsupervised and scalable. While the manuscript does not currently contain explicit tracking of scalar trajectories, the reported accuracy gains that increase with model capacity offer supporting evidence that expertise recovery remains effective. We will add, in the revision, plots of learned expertise scalars over training epochs on multiple datasets together with a short analysis confirming convergence and separation from model-error dynamics. revision: yes
Circularity Check
No circularity in derivation chain; claims rest on explicit modeling and external evaluation
full rationale
The paper introduces REALM via an explicit generative assumption (label as expertise-weighted mixture of current model prediction and uniform) and jointly optimizes model parameters plus per-annotator scalars (or matrix in multi-task). No equation or step equates the reported accuracy gains to the inputs by construction; the mixture is an ansatz, not a tautology, and the performance numbers are obtained by running the optimizer on held-out QA benchmarks under simulated noise. Because the central empirical claims are not forced by re-labeling fitted quantities as predictions and no self-citation chain is load-bearing for the uniqueness of the approach, the derivation remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-annotator expertise scalars
axioms (1)
- domain assumption Annotator reliability is constant across examples and can be summarized by a single scalar
Reference graph
Works this paper leans on
-
[1]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022
2022
-
[2]
Scaling instruction-finetuned language models,
H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024
2024
-
[3]
Fine-Tuning Language Models from Human Preferences
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Chris- tiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review arXiv 1909
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan,et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022
work page Pith review arXiv 2022
-
[5]
Maximum likelihood estimation of observer error-rates using the em algorithm,
A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,”Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979
1979
-
[6]
Learning from crowds.,
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds.,”Journal of machine learning research, vol. 11, no. 4, 2010
2010
-
[7]
LIMA: Less is more for alignment,
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: Less is more for alignment,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023
2023
-
[8]
Learning with noisy labels,
N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,”Advances in neural information processing systems, vol. 26, 2013
2013
-
[9]
Co-teaching: Robust training of deep neural networks with extremely noisy labels,
B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[10]
Making deep neural networks robust to label noise: A loss correction approach,
G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952, 2017
1944
-
[11]
Imitation learn- ing by estimating expertise of demonstrators,
M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani, “Imitation learn- ing by estimating expertise of demonstrators,” inInternational Conference on Machine Learning, pp. 1732–1748, PMLR, 2022
2022
-
[12]
Inverse reinforcement learning by estimating expertise of demonstrators,
M. Beliaev and R. Pedarsani, “Inverse reinforcement learning by estimating expertise of demonstrators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 15532–15540, 2025. 13
2025
-
[13]
Can a suit of armor conduct electricity? a new dataset for open book question answering,
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inProceedings of EMNLP, 2018
2018
-
[14]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” inarXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
PIQA: Reasoning about physical commonsense in natural language,
Y. Bisk, R. Zellers, J. Gao, and Y. Choi, “PIQA: Reasoning about physical commonsense in natural language,” inProceedings of AAAI, 2020
2020
-
[16]
RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,
B.Y.Lin, Z.Wu, Y.Yang, D.-H.Lee, andX.Ren, “RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,” in Findings of ACL, 2021
2021
-
[17]
PubMedQA: A dataset for biomedical research question answering,
Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2567–2577, 2019
2019
-
[18]
Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,
J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” inAdvances in Neural Information Processing Systems, vol. 22, pp. 2035–2043, 2009
2035
-
[19]
Dealing with disagreements: Looking beyond the majority vote in subjective annotations,
A. M. Davani, M. Díaz, and V. Prabhakaran, “Dealing with disagreements: Looking beyond the majority vote in subjective annotations,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022
2022
-
[20]
Jury learning: Integrating dissenting voices into machine learning models,
M. L. Gordon, M. S. Lam, J. S. Park, K. Patel, J. Hancock, T. Hashimoto, and M. S. Bernstein, “Jury learning: Integrating dissenting voices into machine learning models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–19, 2022
2022
-
[21]
The “problem
B. Plank, “The “problem” of human label variation: On ground truth in data, modeling and evaluation,” inProceedings of the 2022 conference on empirical methods in natural language processing, pp. 10671–10682, 2022
2022
-
[22]
Learning from disagreement: A survey,
A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio, “Learning from disagreement: A survey,”Journal of Artificial Intelligence Research, vol. 72, pp. 1385–1470, 2021
2021
-
[23]
arXiv preprint arXiv:2412.14922 , year=
J. Luo, X. Luo, and K. Ding, “RobustFT: Robust supervised fine-tuning for large language models under noisy response,”arXiv preprint arXiv:2412.14922, 2024
-
[24]
Noise-robust fine-tuning of pretrained language models via external guidance,
S. Wang, Z. Tan, and R. Guo, “Noise-robust fine-tuning of pretrained language models via external guidance,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023
2023
-
[25]
SymNoise: Advancing language model fine-tuning with symmetric noise,
A. K. Yadav and A. Singh, “SymNoise: Advancing language model fine-tuning with symmetric noise,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. 14
2023
-
[26]
Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,
Y. Yu, S. Zuo, and H. Jiang, “Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 2021
2021
-
[27]
Efficient training of artificial neural networks for autonomous navigation,
D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,”Neural computation, vol. 3, no. 1, pp. 88–97, 1991
1991
-
[28]
Confidence-aware imitation learning from demonstrations with varying optimality,
S. Zhang, Z. Cao, D. Sadigh, and Y. Sui, “Confidence-aware imitation learning from demonstrations with varying optimality,”Advances in Neural Information Processing Systems, vol. 34, pp. 12340–12350, 2021
2021
-
[29]
Trail: Near-optimal imitation learning with suboptimal data,
M. Yang, S. Levine, and O. Nachum, “Trail: Near-optimal imitation learning with suboptimal data,”arXiv preprint arXiv:2110.14770, 2021
-
[30]
Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,
D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,” inConference on robot learning, pp. 330–359, PMLR, 2020
2020
-
[31]
Self-instruct: Aligning language models with self-generated instructions,
Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Ha- jishirzi, “Self-instruct: Aligning language models with self-generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 13484–13508, 2023
2023
-
[32]
Learning to summarize with human feedback,
N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020
2020
-
[33]
Communication-efficient and tensorized federated fine-tuning of large language models,
S. Ghiasvand, Y. Yang, Z. Xue, M. Alizadeh, Z. Zhang, and R. Pedarsani, “Communication-efficient and tensorized federated fine-tuning of large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 24192–24207, 2025
2025
-
[34]
Decentralized low-rank fine-tuning of large language models,
S. Ghiasvand, M. Alizadeh, and R. Pedarsani, “Decentralized low-rank fine-tuning of large language models,” inProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pp. 334–345, 2025. 15 A Additional Results A.1 Results under Dist. 2○and Dist. 3○ Tables 6 and 7 report results under the remaining two expertise distributions. Un...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.