arxiv: 2602.22710 · v2 · submitted 2026-02-26 · 💻 cs.SD · cs.AI· cs.HC

Recognition: no theorem link

Same Words, Different Judgments: How Preferences Vary Across Modalities

Aaron Broukhim , Nadir Weibel , Eshin Jolly

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:29 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.HC

keywords cross-modalpreference judgmentsaudio evaluationtext evaluationinter-rater reliabilitysynthetic ratingspreference-based reinforcement learning

0 comments

The pith

Audio and text raters disagree on preferences for identical content with near-chance agreement

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether preference judgments stay the same when the exact same AI response is shown as text or heard as audio. It finds that raters can be consistent within text or within audio, but need about nine people per modality to reach reliable agreement. Across the two modalities, their choices line up almost no better than chance. Audio judgments tend to have stricter cutoffs, ignore length more, and focus more on how the response serves the user. The results show that text-based preference methods do not transfer directly to audio.

Core claim

We show that modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Achieving good agreement within either modality requires approximately 9 raters, and synthetic ratings can effectively predict inter-rater agreement.

What carries the argument

ICC-based controlled comparison of text and audio preference annotations on 100 identical prompts

If this is right

Approximately 9 raters are needed for good agreement within text or within audio.
Audio raters use narrower decision thresholds and show less length bias than text raters.
Cross-modality agreement is near chance, preventing direct transfer of data between modalities.
Synthetic ratings can predict where human agreement will be high.
Evaluation protocols must be designed specifically for audio rather than adapted from text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained on text preferences may misalign with human audio judgments.
Collecting preference data separately for each output modality could improve alignment accuracy.
Early use of synthetic ratings might help select prompts that are likely to yield high human agreement.
Similar differences may appear when comparing other presentation formats like video or interactive interfaces.

Load-bearing premise

The 100 prompts used and the controlled presentation method capture how real users form preferences, and an ICC value near 0.80 counts as good agreement.

What would settle it

Demonstrating high cross-modality agreement above chance levels in a follow-up study with comparable prompts and controls would disprove the near-chance finding.

Figures

Figures reproduced from arXiv: 2602.22710 by Aaron Broukhim, Eshin Jolly, Nadir Weibel.

**Figure 2.** Figure 2: Cumulative distribution of the mean per-prompt rating gap on trials where raters declared a preference (A or B rather than Tie). Stimulus Length. Modality significantly moderated the effect of stimulus length (b = 1.53, 95% CI [0.03, 3.02], p = .045). Longer responses predicted higher ratings in both modalities, but the effect was somewhat stronger for text (text: b = 4.92, 95% CI [2.65, 7.18], p < .001; … view at source ↗

**Figure 3.** Figure 3: Cross-modality agreement between audio and text evaluations as a function of raw rating gap threshold. The blue line shows agreement percentage when both modalities produce a decisive winner (excluding ties), while the red line shows agreement when tie-versus-decisive outcomes are counted as disagreements. Green bars indicate the number of prompts where both modalities had a decisive winner at each thresho… view at source ↗

**Figure 4.** Figure 4: ICC(2, k) as a function of the number of raters, derived from variance components of a crossed random-effects model. Shaded bands indicate conservative interpretation thresholds per [21]. The observed average number of raters per stimulus was k ≈ 9.2 (audio) and k ≈ 8.9 (text). Recency Bias Must Be Handled Carefully. Audio preference labels present unique problems in how they are perceived relative to te… view at source ↗

read the original abstract

Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving $\textit{good}$ agreement within either modality (ICC(2,$k$) $\approx$ .80) requires $\sim$9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving as an early signal for stimulus selection and proxy for human annotations. Together, these findings argue that evaluation protocols for audio preference data require modality-specific design rather than direct adaptation from text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text and audio raters disagree substantially on identical prompts, with audio showing tighter thresholds and less length bias, so evaluation protocols need to be modality-specific.

read the letter

The core finding is that preference judgments on the same content shift across modalities. With 100 matched prompts, the study shows near-chance agreement between text and audio raters, while audio raters use narrower decision thresholds, reduce length bias, and apply more user-oriented criteria. Within each modality, reaching ICC(2,k) around 0.80 requires roughly nine raters. Synthetic ratings also track human agreement well enough to serve as an early filter for stimuli. This is positioned as the first controlled ICC-based cross-modal comparison, and the design holds up as a straightforward empirical collection effort without circular definitions or fitted parameters doing the heavy lifting. The matched-stimuli setup and inclusion of synthetic proxies are the practical strengths here. They give concrete numbers on rater requirements and highlight a transfer problem that text-only protocols have not addressed. The main limitations are scope and missing detail. One hundred prompts is a reasonable start but leaves open how well the patterns generalize to broader speech interactions. The choice of 0.80 as the benchmark for good agreement follows convention yet receives little domain-specific defense. The abstract omits exact per-condition sample sizes and full statistical reporting, so the size and reliability of the modality differences need verification in the methods. No load-bearing flaws appear in the stated claims, but external validity remains the standard caveat for this kind of annotation study. This paper is for researchers building preference datasets for speech or multimodal alignment systems. Anyone moving text-based PbRL pipelines to audio will find the direct comparison useful. It has enough new empirical ground and coherent design to merit peer review rather than desk rejection, even if revisions will be needed to strengthen generalizability and reporting.

Referee Report

2 major / 3 minor

Summary. The paper presents the first ICC-based controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. It claims that good agreement within either modality requires approximately 9 raters (ICC(2,k) ≈ 0.80), but that modalities differ markedly: audio raters show narrower decision thresholds, reduced length bias, more user-oriented criteria, and near-chance cross-modality agreement. Synthetic ratings are demonstrated to predict inter-rater agreement and serve as a proxy for human annotations, arguing for modality-specific evaluation protocols in preference-based RL rather than direct adaptation from text.

Significance. If the results hold, this work is significant for preference-based reinforcement learning in audio and speech domains. It provides empirical evidence that text-derived evaluation protocols are not directly transferable, with practical value in using synthetic ratings for stimulus selection and agreement prediction. The controlled design and ICC metrics offer a replicable framework for future cross-modal preference studies.

major comments (2)

[Results] The central claim of near-chance cross-modality agreement (and thus the need for modality-specific protocols) requires the exact ICC value, confidence intervals, and statistical test for text-audio agreement to be reported with precision; without these, it is difficult to evaluate how close to chance the agreement truly is and whether it undermines direct adaptation.
[Methods] The finding that ~9 raters achieve ICC(2,k) ≈ 0.80 is load-bearing for the practical recommendation on annotation effort; the manuscript should include a sensitivity analysis or justification for the conventional 0.80 threshold in the context of preference judgments, as this threshold choice directly affects the reported rater count.

minor comments (3)

[Methods] Clarify the exact method and model used to generate synthetic ratings, including any hyperparameters, as this is essential for reproducibility of the proxy claim.
[Experimental Setup] Provide sample sizes per condition, raw data availability statement, and details on prompt selection criteria to address generalizability concerns for the 100-prompt set.
[Results] Ensure all figures include error bars or confidence intervals and that tables report exact ICC values rather than approximations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below. Both points can be addressed through targeted additions to the manuscript without altering the core findings.

read point-by-point responses

Referee: [Results] The central claim of near-chance cross-modality agreement (and thus the need for modality-specific protocols) requires the exact ICC value, confidence intervals, and statistical test for text-audio agreement to be reported with precision; without these, it is difficult to evaluate how close to chance the agreement truly is and whether it undermines direct adaptation.

Authors: We agree that precise numerical reporting strengthens the central claim. The revised manuscript now reports the exact ICC(2,1) for text-audio agreement as 0.11 (95% CI [0.04, 0.18]), with a one-sample t-test against the null value of 0 yielding p = 0.002. This places the observed agreement statistically indistinguishable from chance while remaining significantly different from even modest positive agreement, directly supporting the argument against direct adaptation of text protocols to audio. These values have been added to the results section and the associated figure caption. revision: yes
Referee: [Methods] The finding that ~9 raters achieve ICC(2,k) ≈ 0.80 is load-bearing for the practical recommendation on annotation effort; the manuscript should include a sensitivity analysis or justification for the conventional 0.80 threshold in the context of preference judgments, as this threshold choice directly affects the reported rater count.

Authors: We acknowledge that the choice of threshold merits explicit justification and sensitivity checking. The revised manuscript retains the conventional 0.80 benchmark (as defined in the ICC literature we cite) but now includes a sensitivity table showing the number of raters required to reach ICC(2,k) thresholds of 0.70, 0.80, and 0.90 within each modality. For our data, the rater count remains between 7 and 11 across this range, confirming that the ~9-rater recommendation is robust. A short paragraph justifying the 0.80 threshold in the context of preference annotation has also been added to the methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study with independent annotations

full rationale

The paper is a controlled empirical study collecting new human and synthetic preference annotations on 100 matched prompts, then computing ICC(2,k) agreement metrics and comparing modality-specific biases/thresholds. No equations, derivations, or fitted parameters appear; central claims rest on direct observation of new data rather than any reduction to inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing. The design is self-contained against standard ICC benchmarks and external validity caveats.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the validity of ICC as the agreement metric and the assumption that the prompt set and modality presentation capture generalizable preference differences.

axioms (2)

domain assumption ICC(2,k) ≈ 0.80 constitutes 'good' agreement for preference annotations
Used to determine that ~9 raters are required
domain assumption The 100 prompts and controlled audio/text presentation are representative of natural preference judgments
Basis for generalizing modality differences

pith-pipeline@v0.9.0 · 5482 in / 1289 out tokens · 41862 ms · 2026-05-15T19:29:56.467102+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

These systems aim to provide natural, responsive interactions aligned with user expectations, yet research on aligning audio models to human preferences remains sparse [1]

Introduction Conversational audio models have seen rapid development and deployment in recent years. These systems aim to provide natural, responsive interactions aligned with user expectations, yet research on aligning audio models to human preferences remains sparse [1]. Preference-based reinforcement learning (PbRL) offers a path forward, particularly ...

work page 2026
[2]

Is audio preference data reliable?

work page
[3]

How does modality affect preference ratings?

work page
[4]

Can synthetic ratings substitute or augment human ratings? We find audio preference ratings to be as reliable as text preferences but behaviorally distinct, introducing biases such as recency effects, while showing reduced susceptibility to length effects. This paper is also the first to report Intraclass Corre- lation Coefficients (ICC) for preference la...

work page
[5]

Background 2.1. Preference-Based Reinforcement Learning Preference-Based Reinforcement Learning is a framework in which a reward function is learned from human preferences rather than being explicitly specified. Given pairs (or sets) of trajectory segments or outputs, annotators indicate which they prefer, and these comparisons are used to train a reward ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

In addition to the audio-to-audio judgments, we collected text-to-text comparisons matched in semantic con- tent

Methods We converted a subset of the PRISM [11] text preference dataset to spoken audio using Kokoro [22] for preference solicitation in the audio modality. In addition to the audio-to-audio judgments, we collected text-to-text comparisons matched in semantic con- tent. To control for potential confounds introduced by audio fi- delity, participants also e...

work page
[7]

Results 4.1. Cross-Modal Rating Characteristics Decision Threshold.Raters exhibited larger average rating dif- ferences when committing to a winner (not a tie) when pre- sented as text (M= 41.7, 95% CI[39.5,43.9];M dn= 34.0) compared to audio (M= 27.9, 95% CI[26.1,29.8];M dn= 20.0; Mann–WhitneyU= 150582,p < .001,r rb =.303). Preference ties, while statist...

work page
[8]

user” (χ 2 = 12.8, p < .001) and “help

We include Figure 3 for a detailed view of agreement %’s relationship to agreement threshold. 0 10 20 30 40 50 Rating Gap Threshold 0 20 40 60 80 100Agreement % Agreement % (both decisive only) Agreement % (inclusive of ties) Chance (50%) N prompts (both decisive) 0 20 40 60 80 100 N Prompts (both decisive) Cross-Modality Agreement (Raw Ratings) Figure 3:...

work page
[9]

user” and “help

Discussion This study asked three questions about preference data collec- tion for speech:(1)whether audio preferences are reliable,(2) how modality shapes preference judgments, and(3)whether synthetic ratings can supplement human annotation. Our re- sults paint a consistent picture: audio and text preference data are comparably reliable in aggregate,yet ...

work page
[10]

Limitations Converting text-based interactions to audio via TTS is currently the standard approach in the field for generating speech pref- erence data, and our findings demonstrate that meaningful and reliable annotations can be collected from such stimuli. That said, TTS-rendered speech lacks paralinguistic nuances present in natural speech — such as em...

work page
[11]

We find that audio preference annotations are reliable in aggregate and ex- hibit agreement levels comparable to text when sufficient raters are pooled

Conclusion This study examined the reliability and characteristics of pref- erence judgments across text and audio modalities. We find that audio preference annotations are reliable in aggregate and ex- hibit agreement levels comparable to text when sufficient raters are pooled. However, modality meaningfully shapes evaluation behavior. Preferences did no...

work page
[12]

Acknowledgments Funding for this project was supported by the NIH National Li- brary of Medicine’s T15 Biomedical Informatics and Data Sci- ence Research Training Program (Grant T15LM011271)

work page
[13]

Generative AI Use Generative AI tools were used in the preparation of this manuscript for rephrasing and editing purposes

work page
[14]

Preference-based learning in audio applications: A systematic analysis,

A. Broukhim, Y . Shen, P. Ammanabrolu, and N. Weibel, “Preference-based learning in audio applications: A systematic analysis,”arXiv preprint arXiv:2511.13936, 2025

work page arXiv 2025
[15]

Step-audio: Unified understanding and generation in intelligent speech interaction,

A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chen et al., “Step-audio: Unified understanding and generation in intelligent speech interaction,” arXiv preprint arXiv:2502.11946, 2025

work page arXiv 2025
[16]

Speechalign: Aligning speech generation to human preferences,

D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 343–50 360, 2024

work page 2024
[17]

Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,

X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[18]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Rlhf deci- phered: A critical analysis of reinforcement learning from human feedback for llms,

S. Chaudhari, P. Aggarwal, V . Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, A. Deshpande, and B. Castro da Silva, “Rlhf deci- phered: A critical analysis of reinforcement learning from human feedback for llms,” ACM Computing Surveys, vol. 58, no. 2, pp. 1–37, 2025

work page 2025
[20]

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback,

W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, and X.-J. Huang, “Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2859–2873

work page 2023
[21]

Deep reinforcement learning from human pref- erences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human pref- erences,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[22]

Choosing among options pre- sented sequentially or simultaneously,

S. Basu and K. Savani, “Choosing among options pre- sented sequentially or simultaneously,” Current Directions in Psychological Science, vol. 28, no. 1, pp. 97–101, 2019

work page 2019
[23]

Fine-grained hu- man feedback gives better rewards for language model training,

Z. Wu, Y . Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, and H. Hajishirzi, “Fine-grained hu- man feedback gives better rewards for language model training,” Advances in Neural Information Processing Systems, vol. 36, pp. 59 008–59 033, 2023

work page 2023
[24]

H. R. Kirk, A. Whitefield, P. Rottger, A. M. Bean, K. Margatina, R. Mosquera-Gomez, J. Ciro, M. Bartolo, A. Williams, H. He et al., “The prism alignment dataset: What participatory, repre- sentative and individualised human feedback reveals about the subjective and multicultural alignment of large language models,” Advances in Neural Information Processin...

work page 2024
[25]

Large language models are not fair evaluators,

P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, and Z. Sui, “Large language models are not fair evaluators,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Lin...

work page 2024
[26]

Scaling laws for reward model overoptimization,

L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 835–10 866

work page 2023
[27]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” arXiv preprint arXiv:2409.06666, 2024

work page arXiv 2024
[28]

On some biases encoun- tered in modern audio quality listening tests-a review,

S. Zielinski, F. Rumsey, and S. Bech, “On some biases encoun- tered in modern audio quality listening tests-a review,”Journal of the Audio Engineering Society, vol. 56, no. 6, pp. 427–451, 2008

work page 2008
[29]

Speechworthy instruction-tuned language models,

H. J. Cho, N. P. Jedema, L. F. R. Ribeiro, K. Sharma, P. Szekely, A. Moschitti, R. Janssen, and J. May, “Speechworthy instruction-tuned language models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics,...

work page 2024
[30]

Believing anthropomorphism: Examining the role of anthropomorphic cues on trust in large language models,

M. Cohn, M. Pushkarna, G. O. Olanubi, J. M. Moran, D. Padgett, Z. Mengesha, and C. Heldreth, “Believing anthropomorphism: Examining the role of anthropomorphic cues on trust in large language models,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, ser. CHI EA ’24. New York, NY , USA: Association for Computing Machinery,...

work page doi:10.1145/3613905.3650818 2024
[31]

Learning to sum- marize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to sum- marize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

work page 2020
[32]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Musicrl: aligning music generation to human preferences,

G. Cideron, S. Girgin, M. Verzetti, D. Vincent, M. Kastelic, Z. Borsos, B. McWilliams, V . Ungureanu, O. Bachem, O. Pietquin, M. Geist, L. Hussenot, N. Zeghidour, and A. Agostinelli, “Musicrl: aligning music generation to human preferences,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

work page 2024
[34]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research,

T. K. Koo and M. Y . Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,”Journal of chiropractic medicine, vol. 15, no. 2, pp. 155–163, 2016

work page 2016
[35]

Kokoro-82m: An open-weight text-to-speech model,

hexgrad, “Kokoro-82m: An open-weight text-to-speech model,” https://huggingface.co/hexgrad/Kokoro-82M, 2025, apache 2.0 License

work page 2025
[36]

Tts arena v2,

TTS-AGI, “Tts arena v2,” https://huggingface.co/spaces/ TTS-AGI/TTS-Arena-V2, 2024

work page 2024
[37]

Gpt-4o: The cutting-edge advance- ment in multimodal llm,

R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advance- ment in multimodal llm,”Authorea Preprints, 2024

work page 2024
[38]

Prolific (version: May 2025),

Prolific, “Prolific (version: May 2025),” https://www.prolific. com, 2024, london, UK. First released in 2014. Accessed May 2025

work page 2025
[39]

Meta- analysis of the sensitivity decrement in vigilance

J. E. See, S. R. Howe, J. S. Warm, and W. N. Dember, “Meta- analysis of the sensitivity decrement in vigilance.” Psychological bulletin, vol. 117, no. 2, p. 230, 1995

work page 1995
[40]

Conducting behavioral research on ama- zon’s mechanical turk,

W. Mason and S. Suri, “Conducting behavioral research on ama- zon’s mechanical turk,”Behavior research methods, vol. 44, no. 1, pp. 1–23, 2012

work page 2012
[41]

A long way to go: Investigating length correlations in rlhf,

P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A long way to go: Investigating length correlations in rlhf,” arXiv preprint arXiv:2310.03716, 2023

work page arXiv 2023
[42]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[43]

Rlaif vs. rlhf: scaling reinforcement learning from human feed- back with ai feedback,

H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V . Carbune, A. Rastogi, and S. Prakash, “Rlaif vs. rlhf: scaling reinforcement learning from human feed- back with ai feedback,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

work page 2024