Recognition: no theorem link
Same Words, Different Judgments: How Preferences Vary Across Modalities
Pith reviewed 2026-05-15 19:29 UTC · model grok-4.3
The pith
Audio and text raters disagree on preferences for identical content with near-chance agreement
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. Achieving good agreement within either modality requires approximately 9 raters, and synthetic ratings can effectively predict inter-rater agreement.
What carries the argument
ICC-based controlled comparison of text and audio preference annotations on 100 identical prompts
If this is right
- Approximately 9 raters are needed for good agreement within text or within audio.
- Audio raters use narrower decision thresholds and show less length bias than text raters.
- Cross-modality agreement is near chance, preventing direct transfer of data between modalities.
- Synthetic ratings can predict where human agreement will be high.
- Evaluation protocols must be designed specifically for audio rather than adapted from text.
Where Pith is reading between the lines
- Models trained on text preferences may misalign with human audio judgments.
- Collecting preference data separately for each output modality could improve alignment accuracy.
- Early use of synthetic ratings might help select prompts that are likely to yield high human agreement.
- Similar differences may appear when comparing other presentation formats like video or interactive interfaces.
Load-bearing premise
The 100 prompts used and the controlled presentation method capture how real users form preferences, and an ICC value near 0.80 counts as good agreement.
What would settle it
Demonstrating high cross-modality agreement above chance levels in a follow-up study with comparable prompts and controls would disprove the near-chance finding.
Figures
read the original abstract
Preference-based reinforcement learning (PbRL) is the dominant framework for aligning AI systems to human preferences. However, evaluation protocols for such data were designed for text and have not been validated for speech. We present the first ICC-based, controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. We show that achieving $\textit{good}$ agreement within either modality (ICC(2,$k$) $\approx$ .80) requires $\sim$9 raters. At the same time, modalities show marked differences in how people report preferences: audio raters exhibit narrower decision thresholds, reduced length bias, and more user-oriented evaluation criteria, with near-chance cross-modality agreement. We demonstrate that synthetic ratings can be used to effectively predict inter-rater agreement, thus serving as an early signal for stimulus selection and proxy for human annotations. Together, these findings argue that evaluation protocols for audio preference data require modality-specific design rather than direct adaptation from text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the first ICC-based controlled cross-modal study of human and synthetic preference annotations, comparing text and audio evaluations of identical semantic content across 100 prompts. It claims that good agreement within either modality requires approximately 9 raters (ICC(2,k) ≈ 0.80), but that modalities differ markedly: audio raters show narrower decision thresholds, reduced length bias, more user-oriented criteria, and near-chance cross-modality agreement. Synthetic ratings are demonstrated to predict inter-rater agreement and serve as a proxy for human annotations, arguing for modality-specific evaluation protocols in preference-based RL rather than direct adaptation from text.
Significance. If the results hold, this work is significant for preference-based reinforcement learning in audio and speech domains. It provides empirical evidence that text-derived evaluation protocols are not directly transferable, with practical value in using synthetic ratings for stimulus selection and agreement prediction. The controlled design and ICC metrics offer a replicable framework for future cross-modal preference studies.
major comments (2)
- [Results] The central claim of near-chance cross-modality agreement (and thus the need for modality-specific protocols) requires the exact ICC value, confidence intervals, and statistical test for text-audio agreement to be reported with precision; without these, it is difficult to evaluate how close to chance the agreement truly is and whether it undermines direct adaptation.
- [Methods] The finding that ~9 raters achieve ICC(2,k) ≈ 0.80 is load-bearing for the practical recommendation on annotation effort; the manuscript should include a sensitivity analysis or justification for the conventional 0.80 threshold in the context of preference judgments, as this threshold choice directly affects the reported rater count.
minor comments (3)
- [Methods] Clarify the exact method and model used to generate synthetic ratings, including any hyperparameters, as this is essential for reproducibility of the proxy claim.
- [Experimental Setup] Provide sample sizes per condition, raw data availability statement, and details on prompt selection criteria to address generalizability concerns for the 100-prompt set.
- [Results] Ensure all figures include error bars or confidence intervals and that tables report exact ICC values rather than approximations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below. Both points can be addressed through targeted additions to the manuscript without altering the core findings.
read point-by-point responses
-
Referee: [Results] The central claim of near-chance cross-modality agreement (and thus the need for modality-specific protocols) requires the exact ICC value, confidence intervals, and statistical test for text-audio agreement to be reported with precision; without these, it is difficult to evaluate how close to chance the agreement truly is and whether it undermines direct adaptation.
Authors: We agree that precise numerical reporting strengthens the central claim. The revised manuscript now reports the exact ICC(2,1) for text-audio agreement as 0.11 (95% CI [0.04, 0.18]), with a one-sample t-test against the null value of 0 yielding p = 0.002. This places the observed agreement statistically indistinguishable from chance while remaining significantly different from even modest positive agreement, directly supporting the argument against direct adaptation of text protocols to audio. These values have been added to the results section and the associated figure caption. revision: yes
-
Referee: [Methods] The finding that ~9 raters achieve ICC(2,k) ≈ 0.80 is load-bearing for the practical recommendation on annotation effort; the manuscript should include a sensitivity analysis or justification for the conventional 0.80 threshold in the context of preference judgments, as this threshold choice directly affects the reported rater count.
Authors: We acknowledge that the choice of threshold merits explicit justification and sensitivity checking. The revised manuscript retains the conventional 0.80 benchmark (as defined in the ICC literature we cite) but now includes a sensitivity table showing the number of raters required to reach ICC(2,k) thresholds of 0.70, 0.80, and 0.90 within each modality. For our data, the rater count remains between 7 and 11 across this range, confirming that the ~9-rater recommendation is robust. A short paragraph justifying the 0.80 threshold in the context of preference annotation has also been added to the methods. revision: yes
Circularity Check
No significant circularity; empirical study with independent annotations
full rationale
The paper is a controlled empirical study collecting new human and synthetic preference annotations on 100 matched prompts, then computing ICC(2,k) agreement metrics and comparing modality-specific biases/thresholds. No equations, derivations, or fitted parameters appear; central claims rest on direct observation of new data rather than any reduction to inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing. The design is self-contained against standard ICC benchmarks and external validity caveats.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption ICC(2,k) ≈ 0.80 constitutes 'good' agreement for preference annotations
- domain assumption The 100 prompts and controlled audio/text presentation are representative of natural preference judgments
Reference graph
Works this paper leans on
-
[1]
Introduction Conversational audio models have seen rapid development and deployment in recent years. These systems aim to provide natural, responsive interactions aligned with user expectations, yet research on aligning audio models to human preferences remains sparse [1]. Preference-based reinforcement learning (PbRL) offers a path forward, particularly ...
work page 2026
-
[2]
Is audio preference data reliable?
-
[3]
How does modality affect preference ratings?
-
[4]
Can synthetic ratings substitute or augment human ratings? We find audio preference ratings to be as reliable as text preferences but behaviorally distinct, introducing biases such as recency effects, while showing reduced susceptibility to length effects. This paper is also the first to report Intraclass Corre- lation Coefficients (ICC) for preference la...
-
[5]
Background 2.1. Preference-Based Reinforcement Learning Preference-Based Reinforcement Learning is a framework in which a reward function is learned from human preferences rather than being explicitly specified. Given pairs (or sets) of trajectory segments or outputs, annotators indicate which they prefer, and these comparisons are used to train a reward ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Methods We converted a subset of the PRISM [11] text preference dataset to spoken audio using Kokoro [22] for preference solicitation in the audio modality. In addition to the audio-to-audio judgments, we collected text-to-text comparisons matched in semantic con- tent. To control for potential confounds introduced by audio fi- delity, participants also e...
-
[7]
Results 4.1. Cross-Modal Rating Characteristics Decision Threshold.Raters exhibited larger average rating dif- ferences when committing to a winner (not a tie) when pre- sented as text (M= 41.7, 95% CI[39.5,43.9];M dn= 34.0) compared to audio (M= 27.9, 95% CI[26.1,29.8];M dn= 20.0; Mann–WhitneyU= 150582,p < .001,r rb =.303). Preference ties, while statist...
-
[8]
user” (χ 2 = 12.8, p < .001) and “help
We include Figure 3 for a detailed view of agreement %’s relationship to agreement threshold. 0 10 20 30 40 50 Rating Gap Threshold 0 20 40 60 80 100Agreement % Agreement % (both decisive only) Agreement % (inclusive of ties) Chance (50%) N prompts (both decisive) 0 20 40 60 80 100 N Prompts (both decisive) Cross-Modality Agreement (Raw Ratings) Figure 3:...
-
[9]
Discussion This study asked three questions about preference data collec- tion for speech:(1)whether audio preferences are reliable,(2) how modality shapes preference judgments, and(3)whether synthetic ratings can supplement human annotation. Our re- sults paint a consistent picture: audio and text preference data are comparably reliable in aggregate,yet ...
-
[10]
Limitations Converting text-based interactions to audio via TTS is currently the standard approach in the field for generating speech pref- erence data, and our findings demonstrate that meaningful and reliable annotations can be collected from such stimuli. That said, TTS-rendered speech lacks paralinguistic nuances present in natural speech — such as em...
-
[11]
Conclusion This study examined the reliability and characteristics of pref- erence judgments across text and audio modalities. We find that audio preference annotations are reliable in aggregate and ex- hibit agreement levels comparable to text when sufficient raters are pooled. However, modality meaningfully shapes evaluation behavior. Preferences did no...
-
[12]
Acknowledgments Funding for this project was supported by the NIH National Li- brary of Medicine’s T15 Biomedical Informatics and Data Sci- ence Research Training Program (Grant T15LM011271)
-
[13]
Generative AI Use Generative AI tools were used in the preparation of this manuscript for rephrasing and editing purposes
-
[14]
Preference-based learning in audio applications: A systematic analysis,
A. Broukhim, Y . Shen, P. Ammanabrolu, and N. Weibel, “Preference-based learning in audio applications: A systematic analysis,”arXiv preprint arXiv:2511.13936, 2025
-
[15]
Step-audio: Unified understanding and generation in intelligent speech interaction,
A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chen et al., “Step-audio: Unified understanding and generation in intelligent speech interaction,” arXiv preprint arXiv:2502.11946, 2025
-
[16]
Speechalign: Aligning speech generation to human preferences,
D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 343–50 360, 2024
work page 2024
-
[17]
Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,
X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[18]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Rlhf deci- phered: A critical analysis of reinforcement learning from human feedback for llms,
S. Chaudhari, P. Aggarwal, V . Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, A. Deshpande, and B. Castro da Silva, “Rlhf deci- phered: A critical analysis of reinforcement learning from human feedback for llms,” ACM Computing Surveys, vol. 58, no. 2, pp. 1–37, 2025
work page 2025
-
[20]
Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback,
W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, and X.-J. Huang, “Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2859–2873
work page 2023
-
[21]
Deep reinforcement learning from human pref- erences,
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human pref- erences,” Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[22]
Choosing among options pre- sented sequentially or simultaneously,
S. Basu and K. Savani, “Choosing among options pre- sented sequentially or simultaneously,” Current Directions in Psychological Science, vol. 28, no. 1, pp. 97–101, 2019
work page 2019
-
[23]
Fine-grained hu- man feedback gives better rewards for language model training,
Z. Wu, Y . Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A. Smith, M. Ostendorf, and H. Hajishirzi, “Fine-grained hu- man feedback gives better rewards for language model training,” Advances in Neural Information Processing Systems, vol. 36, pp. 59 008–59 033, 2023
work page 2023
-
[24]
H. R. Kirk, A. Whitefield, P. Rottger, A. M. Bean, K. Margatina, R. Mosquera-Gomez, J. Ciro, M. Bartolo, A. Williams, H. He et al., “The prism alignment dataset: What participatory, repre- sentative and individualised human feedback reveals about the subjective and multicultural alignment of large language models,” Advances in Neural Information Processin...
work page 2024
-
[25]
Large language models are not fair evaluators,
P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y . Cao, L. Kong, Q. Liu, T. Liu, and Z. Sui, “Large language models are not fair evaluators,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Lin...
work page 2024
-
[26]
Scaling laws for reward model overoptimization,
L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 10 835–10 866
work page 2023
-
[27]
Llama- omni: Seamless speech interaction with large language models,
Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” arXiv preprint arXiv:2409.06666, 2024
-
[28]
On some biases encoun- tered in modern audio quality listening tests-a review,
S. Zielinski, F. Rumsey, and S. Bech, “On some biases encoun- tered in modern audio quality listening tests-a review,”Journal of the Audio Engineering Society, vol. 56, no. 6, pp. 427–451, 2008
work page 2008
-
[29]
Speechworthy instruction-tuned language models,
H. J. Cho, N. P. Jedema, L. F. R. Ribeiro, K. Sharma, P. Szekely, A. Moschitti, R. Janssen, and J. May, “Speechworthy instruction-tuned language models,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics,...
work page 2024
-
[30]
M. Cohn, M. Pushkarna, G. O. Olanubi, J. M. Moran, D. Padgett, Z. Mengesha, and C. Heldreth, “Believing anthropomorphism: Examining the role of anthropomorphic cues on trust in large language models,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, ser. CHI EA ’24. New York, NY , USA: Association for Computing Machinery,...
-
[31]
Learning to sum- marize with human feedback,
N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to sum- marize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020
work page 2020
-
[32]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,”arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Musicrl: aligning music generation to human preferences,
G. Cideron, S. Girgin, M. Verzetti, D. Vincent, M. Kastelic, Z. Borsos, B. McWilliams, V . Ungureanu, O. Bachem, O. Pietquin, M. Geist, L. Hussenot, N. Zeghidour, and A. Agostinelli, “Musicrl: aligning music generation to human preferences,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024
work page 2024
-
[34]
A guideline of selecting and reporting intraclass correlation coefficients for reliability research,
T. K. Koo and M. Y . Li, “A guideline of selecting and reporting intraclass correlation coefficients for reliability research,”Journal of chiropractic medicine, vol. 15, no. 2, pp. 155–163, 2016
work page 2016
-
[35]
Kokoro-82m: An open-weight text-to-speech model,
hexgrad, “Kokoro-82m: An open-weight text-to-speech model,” https://huggingface.co/hexgrad/Kokoro-82M, 2025, apache 2.0 License
work page 2025
-
[36]
TTS-AGI, “Tts arena v2,” https://huggingface.co/spaces/ TTS-AGI/TTS-Arena-V2, 2024
work page 2024
-
[37]
Gpt-4o: The cutting-edge advance- ment in multimodal llm,
R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advance- ment in multimodal llm,”Authorea Preprints, 2024
work page 2024
-
[38]
Prolific, “Prolific (version: May 2025),” https://www.prolific. com, 2024, london, UK. First released in 2014. Accessed May 2025
work page 2025
-
[39]
Meta- analysis of the sensitivity decrement in vigilance
J. E. See, S. R. Howe, J. S. Warm, and W. N. Dember, “Meta- analysis of the sensitivity decrement in vigilance.” Psychological bulletin, vol. 117, no. 2, p. 230, 1995
work page 1995
-
[40]
Conducting behavioral research on ama- zon’s mechanical turk,
W. Mason and S. Suri, “Conducting behavioral research on ama- zon’s mechanical turk,”Behavior research methods, vol. 44, no. 1, pp. 1–23, 2012
work page 2012
-
[41]
A long way to go: Investigating length correlations in rlhf,
P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A long way to go: Investigating length correlations in rlhf,” arXiv preprint arXiv:2310.03716, 2023
-
[42]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023
work page 2023
-
[43]
Rlaif vs. rlhf: scaling reinforcement learning from human feed- back with ai feedback,
H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V . Carbune, A. Rastogi, and S. Prakash, “Rlaif vs. rlhf: scaling reinforcement learning from human feed- back with ai feedback,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.