The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Anamaria Crisan; Jimmy Lin; Pouya Sadeghi

arxiv: 2605.28966 · v1 · pith:XQN3NMEPnew · submitted 2026-05-27 · 💻 cs.CL · cs.HC

The Trust Paradox: How CS Researchers Engage LLM Leaderboards

Pouya Sadeghi , Anamaria Crisan , Jimmy Lin This is my paper

Pith reviewed 2026-06-29 12:47 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords LLM leaderboardstrust paradoxpragmatic skepticismmodel selectionsubfield differencesevaluation infrastructurethematic analysisAI research practices

0 comments

The pith

CS researchers distrust LLM leaderboards yet rely on them as rough decision aids

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how computer science researchers actually interact with LLM leaderboards that rank models on benchmarks. Through interviews it identifies a consistent pattern of pragmatic skepticism in which participants distrust the rankings but still consult them for choices. Peer networks serve as the main selection tool instead, and researchers favor human-voted arena leaderboards over static ones. Subfield culture creates sharp differences in how much leaderboards matter, while nearly all participants want cost transparency added. These patterns show why leaderboard design needs to match real usage rather than assuming rankings drive decisions alone.

Core claim

The central claim is a near-universal paradox of pragmatic skepticism: CS researchers express deep distrust of leaderboard rankings but continue to use them as rough decision-making aids. Peer networks function as the primary model selection mechanism, arena-based leaderboards are preferred over static benchmarks, and influence varies sharply by subfield due to disciplinary culture. Cost transparency stands out as the most requested missing feature across participants.

What carries the argument

Reflexive thematic analysis of semi-structured interviews that surfaces the trust paradox in leaderboard engagement

If this is right

Leaderboard platforms should add cost data and task-specific breakdowns to match how researchers actually consult them.
Arena formats with voter demographics should be expanded since they are consistently preferred.
Design changes must account for subfield differences rather than assuming uniform engagement.
Peer networks should be treated as a core part of model selection infrastructure alongside public rankings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Making cost and task details visible could narrow the gap between stated distrust and continued use.
The same pragmatic skepticism pattern may appear in other ranking systems used for research decisions.
Disciplinary culture as a mediator suggests targeted studies within individual subfields would reveal more nuance.

Load-bearing premise

Interviews with eight researchers across four subfields are enough to detect a near-universal pattern of pragmatic skepticism and subfield differences.

What would settle it

A larger survey or direct observation of researcher behavior showing no consistent distrust-use gap or no subfield variation would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.28966 by Anamaria Crisan, Jimmy Lin, Pouya Sadeghi.

**Figure 2.** Figure 2: Cross-subfield consensus on codes directly relevant to each research question. Pie charts show within-subfield agree [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The trust paradox: participants simultaneously ex [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researchers' actual practice remains empirically uncharted. We address this gap through semi-structured interviews with eight researchers across four computer science subfields, analyzed using reflexive thematic analysis. We find a near-universal paradox of pragmatic skepticism: while participants expressed deep distrust of leaderboard rankings, they continued to use them as rough decision-making aids. Peer networks, not leaderboards, emerged as the primary model selection mechanism, and arena-based (human-voting) leaderboards were consistently preferred over static benchmark leaderboards. Leaderboard influence varied sharply across subfields, revealing that disciplinary culture, not individual attitudes, mediates engagement; for instance, NLP researchers faced state-of-the-art comparison pressure while HCI and Systems/Privacy researchers reported none. Across these differences, however, participants converged on cost transparency as the most demanded missing feature (seven of eight). We translate these findings into concrete design recommendations that align evaluation infrastructure with how researchers actually use it, such as task-specific score breakdowns, cost integration, and voter-demographic disclosure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Eight interviews document researchers distrusting LLM leaderboards yet still using them, plus subfield differences and a cost transparency demand, but the tiny sample undercuts the universality and culture claims.

read the letter

This paper is a qualitative interview study with eight CS researchers from four subfields. It reports that participants expressed distrust in leaderboard rankings but kept using them as rough guides anyway, turned more to peer networks for actual model choices, preferred arena-style human-voting boards, and converged on wanting cost transparency (seven of eight). Subfield differences appear in the pressure felt, with NLP researchers feeling more of it than those in HCI or Systems.

The new part is the interview data itself on actual day-to-day engagement, which the abstract says was uncharted. The work does a decent job pulling out practitioner attitudes and turning them into design suggestions like task-specific breakdowns and voter demographics.

The soft spot is the sample. Eight people, roughly two per subfield, is too narrow to back a "near-universal" paradox or to treat observed differences as disciplinary culture rather than individual variation. The only hard count is the cost item; everything else rests on thematic interpretation of a small, non-random set. That matches the stress-test concern exactly.

The paper is for people who build or study evaluation tools and want some grounded input on how researchers actually behave. A reader focused on meta-research or leaderboard design could pick up useful angles, but it is not positioned to shift broader evaluation practices.

It deserves peer review so the sampling, coding process, and scope of the claims can be checked, but the authors should add clear limits on what the data can support.

Referee Report

2 major / 2 minor

Summary. The paper reports on semi-structured interviews with eight CS researchers across four subfields (NLP, HCI, Systems, Privacy), analyzed via reflexive thematic analysis. It claims a near-universal 'pragmatic skepticism' paradox in which participants distrust LLM leaderboards yet continue to use them as rough aids; peer networks are the primary model-selection mechanism; arena-based leaderboards are preferred over static benchmarks; subfield differences exist (e.g., NLP faces SOTA pressure while HCI/Systems report none); and cost transparency is the most demanded missing feature (7/8 participants). The authors translate these into design recommendations such as task-specific breakdowns, cost integration, and voter-demographic disclosure.

Significance. If the reported patterns are robust, the work supplies the first empirical account of how CS researchers actually engage with LLM leaderboards, documenting a gap between stated distrust and continued use and identifying concrete feature requests that could improve evaluation infrastructure. The emphasis on disciplinary culture as a mediator and the translation into actionable recommendations are strengths that could inform both leaderboard operators and future qualitative studies of research practice.

major comments (2)

[Methods and Results] Methods and Results sections: The central claims of a 'near-universal' pragmatic skepticism and of subfield differences mediated by disciplinary culture rest on a purposive sample of only eight participants (two per subfield in the four-subfield design). Reflexive thematic analysis can surface candidate themes, but the small non-random sample cannot securely distinguish prevalence, universality, or culture-level effects from individual variation or sampling artifacts; only the cost-transparency pattern is quantified (7/8), while the paradox and subfield claims rely on qualitative interpretation.
[Abstract and §4] Abstract and §4 (Findings): The manuscript states that 'disciplinary culture, not individual attitudes, mediates engagement' yet provides no explicit comparison of within- versus between-subfield variation or any saturation or prevalence metrics beyond the single 7/8 count. This weakens the load-bearing assertion that observed differences reflect subfield norms rather than the limited sample.

minor comments (2)

[Methods] Methods: Additional detail on interview protocol, sampling strategy, coding process, and any inter-rater or member-checking procedures would strengthen transparency even for a reflexive thematic analysis.
[Abstract] Abstract: The abstract could briefly note the small sample size to calibrate reader expectations for generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting important limitations in our sample size and the strength of our claims. We agree that the small sample precludes strong claims of universality or definitive subfield cultural effects, and we will revise the manuscript accordingly to present our findings more cautiously as patterns observed in this sample.

read point-by-point responses

Referee: [Methods and Results] Methods and Results sections: The central claims of a 'near-universal' pragmatic skepticism and of subfield differences mediated by disciplinary culture rest on a purposive sample of only eight participants (two per subfield in the four-subfield design). Reflexive thematic analysis can surface candidate themes, but the small non-random sample cannot securely distinguish prevalence, universality, or culture-level effects from individual variation or sampling artifacts; only the cost-transparency pattern is quantified (7/8), while the paradox and subfield claims rely on qualitative interpretation.

Authors: We fully agree with this assessment. Reflexive thematic analysis is appropriate for generating insights from small samples but does not support claims of universality or prevalence. The term 'near-universal' was used to indicate that the pragmatic skepticism theme appeared across all participants in our sample, but we recognize this language implies broader applicability than the data warrant. We will revise the manuscript to remove or qualify such language, explicitly noting the sample size as a limitation and framing findings as candidate themes from a purposive sample of eight researchers. No prevalence metrics beyond the 7/8 for cost will be added, as they are not appropriate for this method. revision: yes
Referee: [Abstract and §4] Abstract and §4 (Findings): The manuscript states that 'disciplinary culture, not individual attitudes, mediates engagement' yet provides no explicit comparison of within- versus between-subfield variation or any saturation or prevalence metrics beyond the single 7/8 count. This weakens the load-bearing assertion that observed differences reflect subfield norms rather than the limited sample.

Authors: We acknowledge the lack of explicit within- versus between-subfield comparisons and the absence of saturation metrics. The claim was based on the observed patterns where subfield affiliation aligned with reported pressures (e.g., NLP vs. others), but without systematic variation analysis, it cannot be distinguished from individual differences. We will revise the abstract and §4 to present this as an observed pattern in the sample rather than a mediated effect by disciplinary culture, and add a limitations section discussing the small sample and the need for larger studies to confirm subfield differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical qualitative study with no derivations or fitted predictions

full rationale

The paper reports results from semi-structured interviews with eight CS researchers analyzed via reflexive thematic analysis. No equations, parameters, predictions, or derivation chains exist that could reduce to inputs by construction. Claims about the trust paradox, subfield differences, and design recommendations rest directly on the interview data and thematic coding rather than any self-definitional, fitted-input, or self-citation mechanisms. The work is self-contained as an empirical qualitative study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions of qualitative social science rather than new parameters or entities.

axioms (1)

domain assumption Reflexive thematic analysis produces reliable patterns from semi-structured interview data.
The paper states it used this method to derive the reported themes.

pith-pipeline@v0.9.1-grok · 5739 in / 1124 out tokens · 37196 ms · 2026-06-29T12:47:33.379591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 24 canonical work pages

[1]

Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M

Norah A. Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M. Saiful Bari, and Haidar Khan. 2024. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards. InProceedings of the 62nd Annual Meeting o...

work page doi:10.18653/v1/2024.acl-long.744 2024
[2]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, The Trust Paradox: How CS Researchers Engage LLM Leaderboard...

2026
[3]

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lam- bert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open LLM Leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard. Accessed: 2026-01-10. You can find the snapshot of this leaderboard at https://web.archive.org/web/202603010...

work page arXiv 2023
[4]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psy- chology.Qualitative Research in Psychology3, 2 (2006), 77–101. doi:10.1191/ 1478088706qp063oa

2006
[5]

Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis.Qualitative Research in Sport, Exercise and Health11, 4 (2019), 589–597. doi:10.1080/2159676X.2019.1628806

work page doi:10.1080/2159676x.2019.1628806 2019
[6]

Virginia Braun and Victoria Clarke. 2021. One size fits all? What counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology18, 3 (2021), 328–352. doi:10.1080/14780887.2020.1769238

work page doi:10.1080/14780887.2020.1769238 2021
[7]

Alexander Campolo. 2025. State-of-the-Art: The Temporal Order of Benchmark- ing Culture.Digital Society4, 2 (2025), 35. doi:10.1007/s44206-025-00190-x

work page doi:10.1007/s44206-025-00190-x 2025
[8]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning...

2024
[9]

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, and Jason Mars. 2025. Ranking Unraveled: Recipes for LLM Rankings in Head-to- Head AI Combat. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Association for Computational Linguistics, Vienna, Austria, 26078–26091. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.acl-long.1265 2025
[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- geNet: A Large-Scale Hierarchical Image Database. In2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255. doi:10.1109/CVPR.2009. 5206848

work page doi:10.1109/cvpr.2009 2009
[11]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gómez, and David Fernández-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 850–864. doi:10.1609/aies.v8i1.36595

work page doi:10.1609/aies.v8i1.36595 2025
[12]

I always assumed that I wasn’t really that close to [her]

Motahhare Eslami, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, Kevin Hamilton, and Christian Sandvig. 2015. “I always assumed that I wasn’t really that close to [her]”: Reasoning about Invisible Algorithms in News Feeds. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15)....

work page doi:10.1145/2702123.2702556 2015
[13]

Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the Eye of the User: A Critique of NLP Leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4846–4853. doi:10.18653/v1/2020.emnlp-main.393

work page doi:10.18653/v1/2020.emnlp-main.393 2020
[14]

Chao Gao, Yandi Shen, and Anderson Y. Zhang. 2023. Uncertainty quantification in the Bradley–Terry–Luce model.Information and Inference: A Journal of the IMA12, 2 (June 2023), 1073–1140. doi:10.1093/imaiai/iaac032

work page doi:10.1093/imaiai/iaac032 2023
[15]

Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, and Arman Cohan. 2025. Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference. InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, Albuquerque, New Mexico, 4605–4629. doi:10.18653/v1/2025.findings-naacl.260

work page doi:10.18653/v1/2025.findings-naacl.260 2025
[16]

Greg Guest, Arwen Bunce, and Laura Johnson. 2006. How Many Interviews Are Enough? An Experiment with Data Saturation and Variability.Field Methods18, 1 (2006), 59–82. doi:10.1177/1525822X05279903

work page doi:10.1177/1525822x05279903 2006
[17]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. InInternational Conference on Learning Representations. OpenRe- view.net, Online. https://openreview.net/forum?id=d7KBjmI3GmQ

2021
[18]

Huang, Yunyi Shen, Dennis Wei, and Tamara Broderick

Jenny Y. Huang, Yunyi Shen, Dennis Wei, and Tamara Broderick. 2026. Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings. InThe Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=jNiEMDsRgc

2026
[19]

Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, and Chiyuan Zhang

Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, and Chiyuan Zhang
[20]

InProceedings of the 42nd International Conference on Machine Learn- ing (Proceedings of Machine Learning Research, Vol

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leader- boards. InProceedings of the 42nd International Conference on Machine Learn- ing (Proceedings of Machine Learning Research, Vol. 267). PMLR, 25654–25671. https://proceedings.mlr.press/v267/huang25z.html
[21]

Bernard Koch, Emily Denton, Alex Hanna, and Jacob Gates Foster. 2021. Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1. https://openreview.net/forum?id=zNQBIBKJRkd

2021
[22]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

2023
[23]

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt
[24]

InProceedings of the Neural Infor- mation Processing Systems Track on Datasets and Benchmarks, Vol

Are We Learning Yet? A Meta Review of Evaluation Fail- ures Across Machine Learning. InProceedings of the Neural Infor- mation Processing Systems Track on Datasets and Benchmarks, Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/ file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf

2021
[25]

Lipton and Jacob Steinhardt

Zachary C. Lipton and Jacob Steinhardt. 2019. Troubling Trends in Machine Learning Scholarship.ACM Queue17, 1 (2019), 45–77. doi:10.1145/3317287. 3328534

work page doi:10.1145/3317287 2019
[26]

Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, and Shijin Wang. 2025. am-ELO: A Stable Framework for Arena-based LLM Evaluation. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 38857– 38868. https://proceedings.mlr.press/v267/liu25ak.html

2025
[27]

Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, and Min Lin. 2025. Improving Your Model Ranking on Chatbot Arena by Vote Rigging. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 44252–44271. https://proceedings.mlr.press/ v267/min25a.html

2025
[28]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasser- man, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru
[29]

InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19)

Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, Atlanta, GA, USA, 220–229. doi:10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596
[30]

Emilee Rader and Rebecca Gray. 2015. Understanding User Beliefs About Algo- rithmic Curation in the Facebook News Feed. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, 173–

2015
[31]

doi:10.1145/2702123.2702174

work page doi:10.1145/2702123.2702174
[32]

Everett M. Rogers. 2003.Diffusion of Innovations(5 ed.). Free Press, New York, NY. https://www.simonandschuster.com/books/Diffusion-of-Innovations-5th- Edition/Everett-M-Rogers/9780743222099

work page arXiv 2003
[33]

Bernstein, Alexander C

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision115, 3 (2015), 211–252. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[34]

Smith, and Oren Etzioni

Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM63, 12 (Dec. 2020), 54–63. doi:10.1145/3381831

work page doi:10.1145/3381831 2020
[35]

Herbert A. Simon. 1955. A Behavioral Model of Rational Choice.The Quarterly Journal of Economics69, 1 (1955), 99–118. doi:10.2307/1884852

work page doi:10.2307/1884852 1955
[36]

Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. 2025. The Leaderboard Illusion. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/for...

2025
[37]

Aarohi Srivastava et al . 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research(2023). https://openreview.net/forum?id=uyTL5Bvosj

2023
[38]

Marilyn Strathern. 1997. ‘Improving ratings’: audit in the British University system.European Review5, 3 (1997), 305–321. doi:10.1017/S1062798700002660

work page doi:10.1017/s1062798700002660 1997
[39]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Association for Computational Linguistics, Florence, Italy, 3645–3650. doi:10.18653/v1/P19- 1355

work page doi:10.18653/v1/p19- 2019
[40]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop Pouya Sadeghi, Anamaria Crisan, and Jimmy Lin BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computa...

work page doi:10.18653/v1/w18- 2018
[41]

Tell me a bit about your current research. What are you working on and what excites you about it?

Weichen Wu, Nynke Niezink, and Brian Junker. 2022. A Diagnostic Framework for the Bradley–Terry Model.Journal of the Royal Statistical Society Series A: Statistics in Society185, Supplement 2 (Nov. 2022), S461–S484. doi:10.1111/rssa. 12959 A Interview Guide The interview was structured into eight phases. The numbered items below were treated as anchor que...

work page doi:10.1111/rssa 2022

[1] [1]

Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M

Norah A. Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M. Saiful Bari, and Haidar Khan. 2024. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards. InProceedings of the 62nd Annual Meeting o...

work page doi:10.18653/v1/2024.acl-long.744 2024

[2] [2]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, The Trust Paradox: How CS Researchers Engage LLM Leaderboard...

2026

[3] [3]

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lam- bert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open LLM Leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard. Accessed: 2026-01-10. You can find the snapshot of this leaderboard at https://web.archive.org/web/202603010...

work page arXiv 2023

[4] [4]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psy- chology.Qualitative Research in Psychology3, 2 (2006), 77–101. doi:10.1191/ 1478088706qp063oa

2006

[5] [5]

Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis.Qualitative Research in Sport, Exercise and Health11, 4 (2019), 589–597. doi:10.1080/2159676X.2019.1628806

work page doi:10.1080/2159676x.2019.1628806 2019

[6] [6]

Virginia Braun and Victoria Clarke. 2021. One size fits all? What counts as quality practice in (reflexive) thematic analysis?Qualitative Research in Psychology18, 3 (2021), 328–352. doi:10.1080/14780887.2020.1769238

work page doi:10.1080/14780887.2020.1769238 2021

[7] [7]

Alexander Campolo. 2025. State-of-the-Art: The Temporal Order of Benchmark- ing Culture.Digital Society4, 2 (2025), 35. doi:10.1007/s44206-025-00190-x

work page doi:10.1007/s44206-025-00190-x 2025

[8] [8]

Gonzalez, and Ion Stoica

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning...

2024

[9] [9]

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, and Jason Mars. 2025. Ranking Unraveled: Recipes for LLM Rankings in Head-to- Head AI Combat. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025). Association for Computational Linguistics, Vienna, Austria, 26078–26091. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.acl-long.1265 2025

[10] [10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- geNet: A Large-Scale Hierarchical Image Database. In2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255. doi:10.1109/CVPR.2009. 5206848

work page doi:10.1109/cvpr.2009 2009

[11] [11]

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gómez, and David Fernández-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 850–864. doi:10.1609/aies.v8i1.36595

work page doi:10.1609/aies.v8i1.36595 2025

[12] [12]

I always assumed that I wasn’t really that close to [her]

Motahhare Eslami, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, Kevin Hamilton, and Christian Sandvig. 2015. “I always assumed that I wasn’t really that close to [her]”: Reasoning about Invisible Algorithms in News Feeds. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15)....

work page doi:10.1145/2702123.2702556 2015

[13] [13]

Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the Eye of the User: A Critique of NLP Leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4846–4853. doi:10.18653/v1/2020.emnlp-main.393

work page doi:10.18653/v1/2020.emnlp-main.393 2020

[14] [14]

Chao Gao, Yandi Shen, and Anderson Y. Zhang. 2023. Uncertainty quantification in the Bradley–Terry–Luce model.Information and Inference: A Journal of the IMA12, 2 (June 2023), 1073–1140. doi:10.1093/imaiai/iaac032

work page doi:10.1093/imaiai/iaac032 2023

[15] [15]

Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, and Arman Cohan. 2025. Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference. InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, Albuquerque, New Mexico, 4605–4629. doi:10.18653/v1/2025.findings-naacl.260

work page doi:10.18653/v1/2025.findings-naacl.260 2025

[16] [16]

Greg Guest, Arwen Bunce, and Laura Johnson. 2006. How Many Interviews Are Enough? An Experiment with Data Saturation and Variability.Field Methods18, 1 (2006), 59–82. doi:10.1177/1525822X05279903

work page doi:10.1177/1525822x05279903 2006

[17] [17]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. InInternational Conference on Learning Representations. OpenRe- view.net, Online. https://openreview.net/forum?id=d7KBjmI3GmQ

2021

[18] [18]

Huang, Yunyi Shen, Dennis Wei, and Tamara Broderick

Jenny Y. Huang, Yunyi Shen, Dennis Wei, and Tamara Broderick. 2026. Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings. InThe Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=jNiEMDsRgc

2026

[19] [19]

Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, and Chiyuan Zhang

Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, and Chiyuan Zhang

[20] [20]

InProceedings of the 42nd International Conference on Machine Learn- ing (Proceedings of Machine Learning Research, Vol

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leader- boards. InProceedings of the 42nd International Conference on Machine Learn- ing (Proceedings of Machine Learning Research, Vol. 267). PMLR, 25654–25671. https://proceedings.mlr.press/v267/huang25z.html

[21] [21]

Bernard Koch, Emily Denton, Alex Hanna, and Jacob Gates Foster. 2021. Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1. https://openreview.net/forum?id=zNQBIBKJRkd

2021

[22] [22]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

2023

[23] [23]

Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt

[24] [24]

InProceedings of the Neural Infor- mation Processing Systems Track on Datasets and Benchmarks, Vol

Are We Learning Yet? A Meta Review of Evaluation Fail- ures Across Machine Learning. InProceedings of the Neural Infor- mation Processing Systems Track on Datasets and Benchmarks, Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/ file/757b505cfd34c64c85ca5b5690ee5293-Paper-round2.pdf

2021

[25] [25]

Lipton and Jacob Steinhardt

Zachary C. Lipton and Jacob Steinhardt. 2019. Troubling Trends in Machine Learning Scholarship.ACM Queue17, 1 (2019), 45–77. doi:10.1145/3317287. 3328534

work page doi:10.1145/3317287 2019

[26] [26]

Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, and Shijin Wang. 2025. am-ELO: A Stable Framework for Arena-based LLM Evaluation. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 38857– 38868. https://proceedings.mlr.press/v267/liu25ak.html

2025

[27] [27]

Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, and Min Lin. 2025. Improving Your Model Ranking on Chatbot Arena by Vote Rigging. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 44252–44271. https://proceedings.mlr.press/ v267/min25a.html

2025

[28] [28]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasser- man, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru

[29] [29]

InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19)

Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, Atlanta, GA, USA, 220–229. doi:10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596

[30] [30]

Emilee Rader and Rebecca Gray. 2015. Understanding User Beliefs About Algo- rithmic Curation in the Facebook News Feed. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, 173–

2015

[31] [31]

doi:10.1145/2702123.2702174

work page doi:10.1145/2702123.2702174

[32] [32]

Everett M. Rogers. 2003.Diffusion of Innovations(5 ed.). Free Press, New York, NY. https://www.simonandschuster.com/books/Diffusion-of-Innovations-5th- Edition/Everett-M-Rogers/9780743222099

work page arXiv 2003

[33] [33]

Bernstein, Alexander C

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision115, 3 (2015), 211–252. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[34] [34]

Smith, and Oren Etzioni

Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM63, 12 (Dec. 2020), 54–63. doi:10.1145/3381831

work page doi:10.1145/3381831 2020

[35] [35]

Herbert A. Simon. 1955. A Behavioral Model of Rational Choice.The Quarterly Journal of Economics69, 1 (1955), 99–118. doi:10.2307/1884852

work page doi:10.2307/1884852 1955

[36] [36]

Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker

Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstun, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, and Sara Hooker. 2025. The Leaderboard Illusion. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/for...

2025

[37] [37]

Aarohi Srivastava et al . 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research(2023). https://openreview.net/forum?id=uyTL5Bvosj

2023

[38] [38]

Marilyn Strathern. 1997. ‘Improving ratings’: audit in the British University system.European Review5, 3 (1997), 305–321. doi:10.1017/S1062798700002660

work page doi:10.1017/s1062798700002660 1997

[39] [39]

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Association for Computational Linguistics, Florence, Italy, 3645–3650. doi:10.18653/v1/P19- 1355

work page doi:10.18653/v1/p19- 2019

[40] [40]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop Pouya Sadeghi, Anamaria Crisan, and Jimmy Lin BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computa...

work page doi:10.18653/v1/w18- 2018

[41] [41]

Tell me a bit about your current research. What are you working on and what excites you about it?

Weichen Wu, Nynke Niezink, and Brian Junker. 2022. A Diagnostic Framework for the Bradley–Terry Model.Journal of the Royal Statistical Society Series A: Statistics in Society185, Supplement 2 (Nov. 2022), S461–S484. doi:10.1111/rssa. 12959 A Interview Guide The interview was structured into eight phases. The numbered items below were treated as anchor que...

work page doi:10.1111/rssa 2022