REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

Mahnoosh Alizadeh; Mark Beliaev; Ramtin Pedarsani; Sajjad Ghiasvand

arxiv: 2604.17289 · v1 · submitted 2026-04-19 · 💻 cs.LG

REALM: Reliable Expertise-Aware Language Model Fine-Tuning from Noisy Annotations

Sajjad Ghiasvand , Mark Beliaev , Mahnoosh Alizadeh , Ramtin Pedarsani This is my paper

Pith reviewed 2026-05-10 06:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords noisy annotationsannotator expertiselanguage model fine-tuningunsupervised learningcrowdsourced dataquestion answeringmixture modelmulti-task learning

0 comments

The pith

REALM jointly learns model parameters and per-annotator expertise scalars unsupervised by modeling each label as a mixture of the current prediction and uniform noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that fine-tuning language models on crowd-sourced labels can avoid absorbing errors from unreliable annotators by learning a scalar expertise weight for each worker during training itself. A sympathetic reader would care because standard practice either discards annotator identity through majority vote or trains directly on noisy data, both of which embed mistakes into the model parameters. The method requires no extra supervision beyond knowing which annotator produced each label and extends to multiple tasks by learning an expertise matrix. Experiments on question-answering benchmarks with simulated noise show the approach beats naive supervised fine-tuning in most settings, with larger gains when noise is high or models are bigger.

Core claim

REALM jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to 50% in the most adversarial regime.

What carries the argument

A per-annotator scalar that weights the current model prediction against a uniform random guess to explain each observed label during joint optimization.

Load-bearing premise

Annotator behavior can be captured by a single scalar expertise weight mixing the model's current prediction with a uniform random guess, and this mixture remains a good model of the data-generating process throughout training.

What would settle it

Apply the method to a dataset where annotators exhibit consistent, non-uniform error patterns on specific question subtypes that cannot be explained by a single overall expertise scalar and verify whether accuracy gains over naive fine-tuning disappear.

Figures

Figures reproduced from arXiv: 2604.17289 by Mahnoosh Alizadeh, Mark Beliaev, Ramtin Pedarsani, Sajjad Ghiasvand.

read the original abstract

Supervised fine-tuning of large language models relies on human-annotated data, yet annotation pipelines routinely involve multiple crowdworkers of heterogeneous expertise. Standard practice aggregates labels via majority vote or simple averaging, discarding annotator identity and causing the model to absorb the errors of unreliable annotators directly into its parameters. We propose REALM, a method that jointly learns the model parameters and a scalar expertise value for each annotator entirely unsupervised, requiring no supervision beyond annotator identity. The key idea is to model each observed label as a mixture between the model's prediction and a uniform random guess, weighted by the annotator's learned expertise. We extend REALM to a multi-task setting via a learned expertise matrix that captures per-annotator reliability across tasks. We evaluate on five question answering benchmarks, fine-tuning three sizes of Flan-T5 under simulated noisy annotations. The proposed algorithm consistently outperforms the naive noisy SFT in the large majority of single- and multi-task settings, across datasets, model sizes, and noise types, with accuracy improvements of up to $50\%$ in the most adversarial regime and gains that grow with model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REALM jointly learns model weights and per-annotator expertise scalars via a mixture model, but all gains come from noise simulated exactly from that same mixture.

read the letter

The core contribution is an unsupervised procedure that optimizes both the fine-tuned LLM and a scalar expertise weight per annotator. Each label is treated as a convex combination of the model's current prediction and a uniform distribution, with the weight learned from the data. They extend the same idea to multi-task settings by replacing the scalar with a matrix of per-annotator, per-task reliabilities. That formulation is new relative to the cited baselines and avoids needing any gold expertise labels.

Referee Report

2 major / 2 minor

Summary. The paper proposes REALM, which jointly optimizes language model parameters and per-annotator expertise scalars (or an expertise matrix in the multi-task extension) by modeling each observed label as an expertise-weighted mixture of the model's current prediction and a uniform random distribution. It evaluates the approach on five question-answering benchmarks by fine-tuning Flan-T5 models of three sizes under simulated noisy annotations, reporting consistent outperformance over naive noisy supervised fine-tuning, with gains up to 50% in high-noise regimes that increase with model capacity.

Significance. If the gains hold under realistic annotator noise, REALM would provide a practical unsupervised way to leverage heterogeneous crowd annotations without discarding annotator identity or requiring gold-standard expertise labels. The joint optimization and multi-task matrix extension are conceptually clean, and the reported scaling of gains with model size is a positive signal worth further investigation.

major comments (2)

[§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.
[§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.

minor comments (2)

The abstract and results tables should report the number of random seeds, standard deviations, and any statistical significance tests for the accuracy improvements.
[§5] Clarify in §5 which specific dataset, noise level, and model size produce the 'up to 50%' improvement cited in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the work.

read point-by-point responses

Referee: [§5] §5 (Experiments): All reported results use noise generated exactly from the mixture p(label) = expertise * model_prediction + (1-expertise) * uniform that REALM assumes during training. This matched generative process allows expertise recovery by construction and does not test the method under realistic misspecification such as class-conditional confusions or position biases typical of crowd annotations.

Authors: We agree that the current experiments rely on noise generated from the exact mixture process assumed by REALM, providing a controlled test of the core mechanism but leaving open questions about robustness to misspecification. In the revised manuscript we will add experiments that simulate class-conditional confusion matrices and position biases drawn from realistic crowd-annotation patterns. These new results will be reported alongside the existing matched-noise results to quantify performance under more varied noise structures. revision: yes
Referee: [§4.1] §4.1 (Method): The single scalar expertise per annotator is assumed constant across examples and tasks (except in the matrix extension); no analysis shows whether the learned scalars remain stable or become entangled with model errors as training progresses, which is load-bearing for the claim that expertise can be recovered unsupervised throughout fine-tuning.

Authors: The constant scalar per annotator is a modeling choice that keeps the approach fully unsupervised and scalable. While the manuscript does not currently contain explicit tracking of scalar trajectories, the reported accuracy gains that increase with model capacity offer supporting evidence that expertise recovery remains effective. We will add, in the revision, plots of learned expertise scalars over training epochs on multiple datasets together with a short analysis confirming convergence and separation from model-error dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on explicit modeling and external evaluation

full rationale

The paper introduces REALM via an explicit generative assumption (label as expertise-weighted mixture of current model prediction and uniform) and jointly optimizes model parameters plus per-annotator scalars (or matrix in multi-task). No equation or step equates the reported accuracy gains to the inputs by construction; the mixture is an ansatz, not a tautology, and the performance numbers are obtained by running the optimizer on held-out QA benchmarks under simulated noise. Because the central empirical claims are not forced by re-labeling fitted quantities as predictions and no self-citation chain is load-bearing for the uniqueness of the approach, the derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mixture model being a faithful representation of annotator behavior and on the optimization successfully recovering meaningful expertise values from noisy labels alone.

free parameters (1)

per-annotator expertise scalars
Learned unsupervised from label data; central to the mixture weighting.

axioms (1)

domain assumption Annotator reliability is constant across examples and can be summarized by a single scalar
Invoked when defining the mixture weight for each label.

pith-pipeline@v0.9.0 · 5512 in / 1262 out tokens · 27836 ms · 2026-05-10T06:48:04.907329+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

2022
[2]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

2024
[3]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Chris- tiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review arXiv 1909
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan,et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022

work page Pith review arXiv 2022
[5]

Maximum likelihood estimation of observer error-rates using the em algorithm,

A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,”Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979

1979
[6]

Learning from crowds.,

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds.,”Journal of machine learning research, vol. 11, no. 4, 2010

2010
[7]

LIMA: Less is more for alignment,

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: Less is more for alignment,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023

2023
[8]

Learning with noisy labels,

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,”Advances in neural information processing systems, vol. 26, 2013

2013
[9]

Co-teaching: Robust training of deep neural networks with extremely noisy labels,

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,”Advances in neural information processing systems, vol. 31, 2018

2018
[10]

Making deep neural networks robust to label noise: A loss correction approach,

G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952, 2017

1944
[11]

Imitation learn- ing by estimating expertise of demonstrators,

M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani, “Imitation learn- ing by estimating expertise of demonstrators,” inInternational Conference on Machine Learning, pp. 1732–1748, PMLR, 2022

2022
[12]

Inverse reinforcement learning by estimating expertise of demonstrators,

M. Beliaev and R. Pedarsani, “Inverse reinforcement learning by estimating expertise of demonstrators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 15532–15540, 2025. 13

2025
[13]

Can a suit of armor conduct electricity? a new dataset for open book question answering,

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inProceedings of EMNLP, 2018

2018
[14]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” inarXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

PIQA: Reasoning about physical commonsense in natural language,

Y. Bisk, R. Zellers, J. Gao, and Y. Choi, “PIQA: Reasoning about physical commonsense in natural language,” inProceedings of AAAI, 2020

2020
[16]

RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,

B.Y.Lin, Z.Wu, Y.Yang, D.-H.Lee, andX.Ren, “RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,” in Findings of ACL, 2021

2021
[17]

PubMedQA: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2567–2577, 2019

2019
[18]

Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,

J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” inAdvances in Neural Information Processing Systems, vol. 22, pp. 2035–2043, 2009

2035
[19]

Dealing with disagreements: Looking beyond the majority vote in subjective annotations,

A. M. Davani, M. Díaz, and V. Prabhakaran, “Dealing with disagreements: Looking beyond the majority vote in subjective annotations,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022

2022
[20]

Jury learning: Integrating dissenting voices into machine learning models,

M. L. Gordon, M. S. Lam, J. S. Park, K. Patel, J. Hancock, T. Hashimoto, and M. S. Bernstein, “Jury learning: Integrating dissenting voices into machine learning models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–19, 2022

2022
[21]

The “problem

B. Plank, “The “problem” of human label variation: On ground truth in data, modeling and evaluation,” inProceedings of the 2022 conference on empirical methods in natural language processing, pp. 10671–10682, 2022

2022
[22]

Learning from disagreement: A survey,

A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio, “Learning from disagreement: A survey,”Journal of Artificial Intelligence Research, vol. 72, pp. 1385–1470, 2021

2021
[23]

arXiv preprint arXiv:2412.14922 , year=

J. Luo, X. Luo, and K. Ding, “RobustFT: Robust supervised fine-tuning for large language models under noisy response,”arXiv preprint arXiv:2412.14922, 2024

work page arXiv 2024
[24]

Noise-robust fine-tuning of pretrained language models via external guidance,

S. Wang, Z. Tan, and R. Guo, “Noise-robust fine-tuning of pretrained language models via external guidance,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023
[25]

SymNoise: Advancing language model fine-tuning with symmetric noise,

A. K. Yadav and A. Singh, “SymNoise: Advancing language model fine-tuning with symmetric noise,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. 14

2023
[26]

Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,

Y. Yu, S. Zuo, and H. Jiang, “Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 2021

2021
[27]

Efficient training of artificial neural networks for autonomous navigation,

D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,”Neural computation, vol. 3, no. 1, pp. 88–97, 1991

1991
[28]

Confidence-aware imitation learning from demonstrations with varying optimality,

S. Zhang, Z. Cao, D. Sadigh, and Y. Sui, “Confidence-aware imitation learning from demonstrations with varying optimality,”Advances in Neural Information Processing Systems, vol. 34, pp. 12340–12350, 2021

2021
[29]

Trail: Near-optimal imitation learning with suboptimal data,

M. Yang, S. Levine, and O. Nachum, “Trail: Near-optimal imitation learning with suboptimal data,”arXiv preprint arXiv:2110.14770, 2021

work page arXiv 2021
[30]

Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,

D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,” inConference on robot learning, pp. 330–359, PMLR, 2020

2020
[31]

Self-instruct: Aligning language models with self-generated instructions,

Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Ha- jishirzi, “Self-instruct: Aligning language models with self-generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 13484–13508, 2023

2023
[32]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

2020
[33]

Communication-efficient and tensorized federated fine-tuning of large language models,

S. Ghiasvand, Y. Yang, Z. Xue, M. Alizadeh, Z. Zhang, and R. Pedarsani, “Communication-efficient and tensorized federated fine-tuning of large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 24192–24207, 2025

2025
[34]

Decentralized low-rank fine-tuning of large language models,

S. Ghiasvand, M. Alizadeh, and R. Pedarsani, “Decentralized low-rank fine-tuning of large language models,” inProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pp. 334–345, 2025. 15 A Additional Results A.1 Results under Dist. 2○and Dist. 3○ Tables 6 and 7 report results under the remaining two expertise distributions. Un...

2025

[1] [1]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

2022

[2] [2]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024

2024

[3] [3]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Chris- tiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review arXiv 1909

[4] [4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan,et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022

work page Pith review arXiv 2022

[5] [5]

Maximum likelihood estimation of observer error-rates using the em algorithm,

A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,”Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979

1979

[6] [6]

Learning from crowds.,

V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds.,”Journal of machine learning research, vol. 11, no. 4, 2010

2010

[7] [7]

LIMA: Less is more for alignment,

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy, “LIMA: Less is more for alignment,” inAdvances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023

2023

[8] [8]

Learning with noisy labels,

N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,”Advances in neural information processing systems, vol. 26, 2013

2013

[9] [9]

Co-teaching: Robust training of deep neural networks with extremely noisy labels,

B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,”Advances in neural information processing systems, vol. 31, 2018

2018

[10] [10]

Making deep neural networks robust to label noise: A loss correction approach,

G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952, 2017

1944

[11] [11]

Imitation learn- ing by estimating expertise of demonstrators,

M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani, “Imitation learn- ing by estimating expertise of demonstrators,” inInternational Conference on Machine Learning, pp. 1732–1748, PMLR, 2022

2022

[12] [12]

Inverse reinforcement learning by estimating expertise of demonstrators,

M. Beliaev and R. Pedarsani, “Inverse reinforcement learning by estimating expertise of demonstrators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, pp. 15532–15540, 2025. 13

2025

[13] [13]

Can a suit of armor conduct electricity? a new dataset for open book question answering,

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inProceedings of EMNLP, 2018

2018

[14] [14]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” inarXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

PIQA: Reasoning about physical commonsense in natural language,

Y. Bisk, R. Zellers, J. Gao, and Y. Choi, “PIQA: Reasoning about physical commonsense in natural language,” inProceedings of AAAI, 2020

2020

[16] [16]

RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,

B.Y.Lin, Z.Wu, Y.Yang, D.-H.Lee, andX.Ren, “RiddleSense: Reasoningabout riddle questions featuring linguistic creativity and commonsense knowledge,” in Findings of ACL, 2021

2021

[17] [17]

PubMedQA: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “PubMedQA: A dataset for biomedical research question answering,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 2567–2577, 2019

2019

[18] [18]

Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,

J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” inAdvances in Neural Information Processing Systems, vol. 22, pp. 2035–2043, 2009

2035

[19] [19]

Dealing with disagreements: Looking beyond the majority vote in subjective annotations,

A. M. Davani, M. Díaz, and V. Prabhakaran, “Dealing with disagreements: Looking beyond the majority vote in subjective annotations,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 2022

2022

[20] [20]

Jury learning: Integrating dissenting voices into machine learning models,

M. L. Gordon, M. S. Lam, J. S. Park, K. Patel, J. Hancock, T. Hashimoto, and M. S. Bernstein, “Jury learning: Integrating dissenting voices into machine learning models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp. 1–19, 2022

2022

[21] [21]

The “problem

B. Plank, “The “problem” of human label variation: On ground truth in data, modeling and evaluation,” inProceedings of the 2022 conference on empirical methods in natural language processing, pp. 10671–10682, 2022

2022

[22] [22]

Learning from disagreement: A survey,

A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio, “Learning from disagreement: A survey,”Journal of Artificial Intelligence Research, vol. 72, pp. 1385–1470, 2021

2021

[23] [23]

arXiv preprint arXiv:2412.14922 , year=

J. Luo, X. Luo, and K. Ding, “RobustFT: Robust supervised fine-tuning for large language models under noisy response,”arXiv preprint arXiv:2412.14922, 2024

work page arXiv 2024

[24] [24]

Noise-robust fine-tuning of pretrained language models via external guidance,

S. Wang, Z. Tan, and R. Guo, “Noise-robust fine-tuning of pretrained language models via external guidance,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

2023

[25] [25]

SymNoise: Advancing language model fine-tuning with symmetric noise,

A. K. Yadav and A. Singh, “SymNoise: Advancing language model fine-tuning with symmetric noise,” inProceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. 14

2023

[26] [26]

Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,

Y. Yu, S. Zuo, and H. Jiang, “Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 2021

2021

[27] [27]

Efficient training of artificial neural networks for autonomous navigation,

D. A. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,”Neural computation, vol. 3, no. 1, pp. 88–97, 1991

1991

[28] [28]

Confidence-aware imitation learning from demonstrations with varying optimality,

S. Zhang, Z. Cao, D. Sadigh, and Y. Sui, “Confidence-aware imitation learning from demonstrations with varying optimality,”Advances in Neural Information Processing Systems, vol. 34, pp. 12340–12350, 2021

2021

[29] [29]

Trail: Near-optimal imitation learning with suboptimal data,

M. Yang, S. Levine, and O. Nachum, “Trail: Near-optimal imitation learning with suboptimal data,”arXiv preprint arXiv:2110.14770, 2021

work page arXiv 2021

[30] [30]

Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,

D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learn- ing via automatically-ranked demonstrations,” inConference on robot learning, pp. 330–359, PMLR, 2020

2020

[31] [31]

Self-instruct: Aligning language models with self-generated instructions,

Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Ha- jishirzi, “Self-instruct: Aligning language models with self-generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 13484–13508, 2023

2023

[32] [32]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

2020

[33] [33]

Communication-efficient and tensorized federated fine-tuning of large language models,

S. Ghiasvand, Y. Yang, Z. Xue, M. Alizadeh, Z. Zhang, and R. Pedarsani, “Communication-efficient and tensorized federated fine-tuning of large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 24192–24207, 2025

2025

[34] [34]

Decentralized low-rank fine-tuning of large language models,

S. Ghiasvand, M. Alizadeh, and R. Pedarsani, “Decentralized low-rank fine-tuning of large language models,” inProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pp. 334–345, 2025. 15 A Additional Results A.1 Results under Dist. 2○and Dist. 3○ Tables 6 and 7 report results under the remaining two expertise distributions. Un...

2025