arxiv: 2604.16304 · v1 · submitted 2026-01-25 · 💻 cs.SE · cs.AI· cs.HC

Recognition: no theorem link

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

Willem van der Maden , Malak Sadek , Ziang Xiao , Aske Mottelson , Q. Vera Liao , Jichen Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC

keywords LLM evaluationpractitioner practicesresults-actionability gapproduct teamsevaluation challengesHCIactionable insightsAI products

0 comments

The pith

Practitioners evaluating LLM products gather data they cannot turn into concrete improvements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports interviews with nineteen practitioners across sectors who evaluate LLM-powered products. It documents ten practices from informal vibe checks to organizational meta-work and confirms four known challenges while naming a new one: the results-actionability gap. In this gap, teams collect evaluation data yet lack ways to convert findings into specific product changes. The work presents strategies drawn from successful teams to bridge the gap and argues that informal practices are adaptations to LLM unpredictability rather than failures of method.

Core claim

Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices and introduce the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. We contribute strategies to bridge this gap, drawn from patterns in successful teams, and argue that interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures.

What carries the argument

The results-actionability gap: the observed disconnect in which evaluation data on LLM products is collected but cannot be converted into specific, actionable product improvements.

If this is right

LLM product teams rely on a spectrum of practices from informal checks to formal meta-work.
Successful teams develop organizational processes to link evaluation results to design changes.
Interpretive practices arise as adaptations to the unpredictable behavior of LLMs.
HCI research should support systematizing current practices instead of replacing them with new frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Product teams could benefit from lightweight tools that map evaluation signals directly to candidate fixes.
The gap may lengthen iteration cycles for LLM features in production environments.
Similar actionability problems could appear when evaluating other non-deterministic AI systems.

Load-bearing premise

That the patterns seen in interviews with nineteen practitioners generalize to most LLM product teams and reflect necessary adaptations to LLM traits rather than avoidable team shortcomings.

What would settle it

A larger study that finds most LLM product teams routinely convert evaluation data into specific improvements without special bridging strategies would falsify the gap as a widespread phenomenon.

read the original abstract

How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a results-actionability gap from 19 interviews but the evidence that it is a distinct LLM-driven issue rather than routine evaluation immaturity is thin.

read the letter

The core observation is that product teams gather LLM evaluation data yet often cannot turn it into concrete fixes. The authors label this the results-actionability gap and position it as a fifth challenge on top of four already in the literature. They draw the claim from interviews with 19 practitioners and list ten practices that range from quick vibe checks to heavier organizational work. They also sketch strategies that more successful teams use to move from ad-hoc to more systematic evaluation. That practical angle is the paper's clearest contribution; it surfaces real friction points that many teams will recognize. The work is timely for HCI researchers who want to understand how evaluation actually happens outside academic benchmarks. The sample covers several sectors, which helps, and the authors are careful not to overclaim statistical generality. Still, the central framing rests on an interpretive step that is not strongly supported. The paper asserts that the observed informal practices are necessary adaptations to LLM properties rather than signs that the teams simply lack mature evaluation processes. No comparative data from non-LLM products or from teams that already run tighter evaluations appears, so the distinction is hard to verify from the interviews alone. Methodological details are also light: the abstract gives no protocol, coding scheme, or reliability checks, which makes it difficult to judge how robust the themes are. For a qualitative study this size, those gaps are noticeable but not fatal. The paper is aimed at HCI researchers and industry practitioners who build or evaluate LLM features. It will be most useful to readers who already work in that space and want pointers on common pain points. It deserves a serious referee because the topic is current and the interview data, even with its limits, adds concrete examples that are not widely documented yet. A review would likely press on the novelty claim and the adaptation-versus-shortcoming distinction, which is fair.

Referee Report

3 major / 2 minor

Summary. The paper presents findings from interviews with 19 practitioners across sectors on how they evaluate LLM-powered products. It identifies ten evaluation practices (from informal 'vibe checks' to organizational meta-work), confirms four previously documented challenges, and introduces a novel 'results-actionability gap' in which teams collect evaluation data but cannot translate it into concrete product improvements. Drawing on patterns from successful teams, it proposes strategies to bridge the gap and frames the observed ad-hoc interpretive practices as necessary adaptations to LLM characteristics rather than methodological shortcomings, while highlighting opportunities for HCI research to support systematization.

Significance. If the findings hold, the work offers timely empirical grounding for the practical realities of LLM product evaluation in industry, documenting a previously unarticulated results-actionability gap and surfacing actionable strategies. This contributes to HCI and SE by shifting focus from new frameworks to supporting the formalization of existing practices, with potential to inform tool design and organizational processes for handling non-deterministic systems.

major comments (3)

[Methods] Methods section: the description of the interview protocol, participant selection criteria, coding process (including how themes were derived and iterated), and any inter-rater reliability measures is absent or insufficiently detailed. Because the identification of the ten practices and the results-actionability gap rests entirely on this thematic analysis, the lack of these details prevents full assessment of the analysis's rigor and reproducibility.
[Findings/Discussion] Findings and Discussion sections: the assertion that ad-hoc practices such as vibe checks are 'necessary adaptations to LLM characteristics' rather than 'methodological failures' is presented without distinguishing evidence. No comparative cases (e.g., teams using more systematic methods), external benchmarks, or falsification attempts are provided; the distinction is drawn solely from interpretive patterns in the 19-interview sample, weakening the novelty and framing of the results-actionability gap.
[Findings] §4 (or equivalent findings section): the generalization that the observed patterns and strategies apply beyond the sample to 'LLM product teams' broadly is not supported by discussion of sample limitations, saturation criteria, or transferability considerations, which is load-bearing for the claim that the gap is a widespread phenomenon requiring specific bridging strategies.

minor comments (2)

[Abstract] Abstract: the phrase 'our analysis suggests these interpretive practices are necessary adaptations' could be qualified to reflect that this is an interpretive claim from the sample rather than a tested causal finding.
The manuscript would benefit from a table summarizing the ten practices with brief definitions and example quotes to improve readability and allow readers to map them directly to the results-actionability gap.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the paper's empirical contributions to understanding LLM product evaluation practices. We have addressed each major comment by expanding methodological transparency, strengthening the evidential basis for interpretive claims, and adding explicit discussion of sample limitations and transferability. All revisions will be incorporated in the next manuscript version.

read point-by-point responses

Referee: [Methods] Methods section: the description of the interview protocol, participant selection criteria, coding process (including how themes were derived and iterated), and any inter-rater reliability measures is absent or insufficiently detailed. Because the identification of the ten practices and the results-actionability gap rests entirely on this thematic analysis, the lack of these details prevents full assessment of the analysis's rigor and reproducibility.

Authors: We agree that the original Methods section lacked sufficient detail for full assessment of rigor. In the revised manuscript we will expand this section to include: the full semi-structured interview protocol with example prompts; explicit participant selection criteria (roles, years of experience, sectors, and LLM product involvement); the thematic analysis process following Braun and Clarke, including how initial codes were generated, how themes were iteratively derived and refined across the dataset, and the specific steps taken to ensure consistency (multiple authors independently coded a subset of transcripts, followed by joint discussion to resolve discrepancies). revision: yes
Referee: [Findings/Discussion] Findings and Discussion sections: the assertion that ad-hoc practices such as vibe checks are 'necessary adaptations to LLM characteristics' rather than 'methodological failures' is presented without distinguishing evidence. No comparative cases (e.g., teams using more systematic methods), external benchmarks, or falsification attempts are provided; the distinction is drawn solely from interpretive patterns in the 19-interview sample, weakening the novelty and framing of the results-actionability gap.

Authors: We acknowledge that the original framing presented the interpretation without sufficient supporting excerpts or contrast. In revision we will add direct participant quotes illustrating how teams that attempted more structured quantitative methods still encountered non-determinism barriers that rendered results non-actionable, thereby grounding the 'necessary adaptation' claim in the data. We will also revise the language to present this as an interpretation emerging from the observed patterns rather than a definitive causal claim, while preserving the novelty of the results-actionability gap as the core contribution. revision: partial
Referee: [Findings] §4 (or equivalent findings section): the generalization that the observed patterns and strategies apply beyond the sample to 'LLM product teams' broadly is not supported by discussion of sample limitations, saturation criteria, or transferability considerations, which is load-bearing for the claim that the gap is a widespread phenomenon requiring specific bridging strategies.

Authors: We agree that the manuscript should more explicitly address generalizability. The revised version will include a new Limitations subsection that reports sample characteristics (19 practitioners across 12 organizations and 5 sectors), describes how thematic saturation was assessed (no new themes emerged after the 14th interview, with the remaining five confirming existing themes), and discusses transferability considerations, including the diversity of the sample while cautioning that findings represent patterns observed in this group rather than universal claims. The bridging strategies will be framed as empirically derived suggestions rather than prescriptive for all teams. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical qualitative study with independent observations

full rationale

The paper reports thematic analysis from 19 practitioner interviews, identifying practices and introducing the 'results-actionability gap' as an observed pattern. No equations, derivations, fitted parameters, or self-citations appear that reduce any claim to prior inputs by construction. The interpretive suggestion that practices are 'necessary adaptations' is framed as analysis of the sample data rather than a formal reduction or uniqueness theorem. The work is self-contained against external benchmarks as an exploratory qualitative contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that semi-structured interviews with a small purposive sample can reliably surface generalizable patterns in practitioner behavior, plus the descriptive invention of the results-actionability gap to organize observed data.

axioms (1)

domain assumption Qualitative data from 19 practitioner interviews accurately captures real-world evaluation practices and challenges.
Invoked throughout the interpretation of findings and the claim that interpretive practices are necessary adaptations.

invented entities (1)

results-actionability gap no independent evidence
purpose: To label and foreground the observed disconnect between collected evaluation data and the ability to act on it.
Newly named phenomenon derived from interview patterns; no independent falsifiable test provided beyond the current data.

pith-pipeline@v0.9.0 · 5475 in / 1274 out tokens · 60095 ms · 2026-05-16T11:23:28.835438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 7 internal anchors

[1]

2001.Introduction to measurement theory

Mary J Allen and Wendy M Yen. 2001.Introduction to measurement theory. Waveland Press, Long Grove, IL

work page 2001
[2]

Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300

work page 2019
[4]

Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine36, 1 (2015), 15–24

work page 2015
[5]

Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexan- der Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. LLMs ins...

work page doi:10.18653/v1/2025.acl-short.20 2025
[6]

Anja Belz and Ehud Reiter. 2006. Comparing Automatic and Human Evaluation of NLG Systems. In11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy, 313–320

work page 2006
[7]

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. 2024. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782(2024)

work page internal anchor Pith review arXiv 2024
[8]

2022.Thematic Analysis: A Practical Guide

Virginia Braun and Victoria Clarke. 2022.Thematic Analysis: A Practical Guide. Sage, London, UK

work page 2022
[9]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models.ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. https://doi.org/10.1145/3641289

work page doi:10.1145/3641289 2024
[10]

Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. 2021. All That’s ’Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text. arXiv:2107.00061 [cs]

work page arXiv 2021
[11]

Collins, Albert Q

Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, and Mateja Jamnik. 2024. Evaluating language models for mathematics through interactions.Proceed- ings of the National Academy of Sciences121, 24 (June 2024),...

work page doi:10.1073/pnas.2318124121 2024
[12]

Bennett, Gary Hsieh, and Sean A

Lucas Colusso, Cynthia L. Bennett, Gary Hsieh, and Sean A. Munson. 2017. Translational Resources: Reducing the Gap Between Academic Research and HCI Practice. InProceedings of the 2017 Conference on Designing Interactive Systems (DIS ’17). ACM, New York, NY, USA, 957–968

work page 2017
[13]

Ward Cunningham. 1992. The WyCash Portfolio Management System. In Addendum to the Proceedings on Object-oriented Programming Systems, Lan- guages, and Applications (OOPSLA ’92). ACM, New York, NY, USA, 29–30. https://doi.org/10.1145/157709.157715

work page doi:10.1145/157709.157715 1992
[14]

Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics10 (2022), 92–110

work page 2022
[15]

Deloitte AI Institute. 2021. Women in AI: Infographic. https://www2.deloitte. com/content/dam/Deloitte/us/Documents/deloitte-analytics/us-consulting-ai- institute-women-in-ai-infographic.pdf. [Accessed: 2025-09-01]

work page 2021
[16]

Finale Doshi-Velez and Been Kim. 2017. Towards a Rigorous Science of Inter- pretable Machine Learning.arXiv preprint arXiv:1702.08608(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Aparna Elangovan, Ling Liu, Lei Xu, Sravan Babu Bodapati, and Dan Roth. 2024. ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). As...

work page doi:10.18653/v1/2024.acl- 2024
[18]

Edona Elshan, Christian Engel, Philipp Ebel, and Dominik Siemon. 2022. Assess- ing the Reusability of Design Principles in the Realm of Conversational Agents. In The Transdisciplinary Reach of Design Science Research: 17th International Confer- ence on Design Science Research in Information Systems and Technology, DESRIST 2022, St Petersburg, FL, USA, Jun...

work page doi:10.1007/978-3-031- 2022
[19]

Steven Fokkinga, Pieter Desmet, and Paul Hekkert. 2020. Impact-centered design: Introducing an integrated framework of the psychological and behavioral effects of design.International Journal of Design14, 3 (2020), 97

work page 2020
[20]

Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P

Shannon K. Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P. Brown, Eric Heim, William R. Nichols, Scott Mcmillan, Swati Rallapalli, Carol J. Smith, Nathan Vanhoudnos, Nick Winski, and Andrew O. Mellinger. 2024. Assessing LLMs for High Stakes Applications. InProceedings of the 46th International Conference on Software Engineering: Software Engineering ...

work page doi:10.1145/3639477.3639720 2024
[21]

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. 2023. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.Journal of Artificial Intelligence Research77 (May 2023), 103–166. https: //doi.org/10.1613/jair.1.13715

work page doi:10.1613/jair.1.13715 2023
[23]

Steve Harrison, Deborah Tatar, and Phoebe Sengers. 2007. The three paradigms of HCI(alt.CHI ’07). Association for Computing Machinery, New York, NY, USA, 1–18

work page 2007
[24]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. https://doi.org/10.48550/arXiv.2009.03300 arXiv:2009.03300 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2021
[25]

Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–16. https:...

work page arXiv 2019
[26]

Lujain Ibrahim, Saffron Huang, Lama Ahmad, and Markus Anderljung. 2024. Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. https://doi.org/10.48550/arXiv.2405.10632 arXiv:2405.10632 [cs]

work page doi:10.48550/arxiv.2405.10632 2024
[27]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 375–385. https://doi.org/10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021
[28]

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit...

work page doi:10.18653/v1/2024.emnlp-main.248 2024
[29]

Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael Bernstein, and Percy Liang

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ash- win Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael Bernstein, and Percy Liang. 2023. Evaluating Human- Language Model Interaction. arXiv:2212.09746 [cs]. CHI ’26, A...

work page arXiv 2023
[30]

Alexia Leibbrandt. 2021. Women and the Digital Revolution. InUNESCO Science Report: The Race Against Time for Smarter Development. UNESCO, Chapter 3. [Accessed: 2025-09-01]

work page 2021
[31]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211 2023
[32]

Vera Liao and Ziang Xiao

Q. Vera Liao and Ziang Xiao. 2023. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. arXiv:2306.03100 [cs]

work page arXiv 2023
[33]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2024. AgentBench: Evaluating LLMs as Agents. InProceedings of the Twelfth International Conference on Learning Representations

work page 2024
[34]

Yixin Liu, Alexander Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, and Arman Cohan. 2024. Bench- marking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Hel...

work page doi:10.18653/v1/2024.findings-naacl.280 2024
[35]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. http...

work page doi:10.18653/v1/ 2023
[36]

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulić, Anna Ko- rhonen, and Nigel Collier. 2024. Aligning with human judgement: The role of pairwise preference in large language model evaluators.arXiv preprint arXiv:2403.16950(2024)

work page arXiv 2024
[37]

Vera Liao, Alexandra Olteanu, and Ziang Xiao

Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, and Ziang Xiao. 2024. ECBD: Evidence-Centered Benchmark Design for NLP

work page 2024
[38]

Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. InPro- ceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI(Lisbon, Portugal)(CAIN ’24). Association for Computing Ma- chinery, New York, NY, USA, 166–171. https://do...

work page doi:10.1145/3644815.3644950 2024
[39]

Alina Mailach, Sebastian Simon, Johannes Dorn, and Norbert Siegmund. 2025. Themes of Building LLM-Based Applications for Production: A Practitioner’s View . In2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE Computer Society, Los Alamitos, CA, USA, 18–30. https://doi.org/10.1109/CAIN66642.2025.00011

work page doi:10.1109/cain66642.2025.00011 2025
[40]

Raiza Martin and Usama Bin Shafqat. 2024. How NotebookLM Was Made. Latent Space podcast. https://www.latent.space/p/notebooklm

work page 2024
[41]

Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. https://doi.org/10.1109/TAI.2025.3569516

work page doi:10.1109/tai.2025.3569516 2025
[42]

Samuel Messick. 1995. Validity of psychological assessment: Validation of infer- ences from persons’ responses and performances as scientific inquiry into score meaning.American Psychologist50, 9 (1995), 741–749

work page 1995
[43]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika ...

work page 2023
[44]

Nadia Nahar, Christian Kastner, Jenna Butler, Chris Parnin, Thomas Zimmer- mann., and Christian Bird. 2025. Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products . In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). IEEE Computer Soci...

work page doi:10.1109/icse-seip66354.2025.00051 2025
[45]

David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models

work page 2022
[46]

Donald A Norman and Pieter Jan Stappers. 2016. DesignX: complex sociotechnical systems.She Ji: The Journal of Design, Economics, and Innovation1, 2 (2016), 83–106

work page 2016
[47]

Qian Pan, Zahra Ashktorab, Michael Desmond, Martin Santillan Cooper, James Johnson, Rahul Nair, Elizabeth Daly, and Werner Geyer. 2024. Human-Centered Design Recommendations for LLM-as-a-Judge. arXiv:2407.03479 [cs]

work page arXiv 2024
[48]

Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. 2025. Building Your Own Product Copilot: Challenges, Op- portunities, and Needs. In2025 IEEE International Conference on Software Anal- ysis, Evolution and Reengineering (SANER). 338–348. https://doi.org/10.1109/ SANER64311.2025.00039

work page arXiv 2025
[49]

White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes

Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: defining an end-to-end frame- work for internal algorithmic auditing. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency(Barcel...

work page arXiv 2020
[50]

Roedl and Erik Stolterman

David J. Roedl and Erik Stolterman. 2013. Design research at CHI and its applicabil- ity to design practice. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Paris, France)(CHI ’13). Association for Computing Machin- ery, New York, NY, USA, 1951–1954. https://doi.org/10.1145/2470654.2466257

work page doi:10.1145/2470654.2466257 2013
[51]

Malak Sadek, Rafael A Calvo, and Celine Mougenot. 2023. Trends, Challenges and Processes in Conversational Agent Design: Exploring Practitioners’ Views through Semi-Structured Interviews. InProceedings of the 5th International Con- ference on Conversational User Interfaces(Eindhoven, Netherlands)(CUI ’23). Association for Computing Machinery, New York, NY...

work page doi:10.1145/3571884.3597143 2023
[52]

Malak Sadek and Celine Mougenot. 2025. Challenges in Value-Sensitive AI Design: Insights from AI Practitioner Interviews.International Journal of Hu- man–Computer Interaction41, 17 (2025), 10877–10894. https://doi.org/10.1080/ 10447318.2024.2439021

work page arXiv 2025
[53]

Shreya Shankar, J. D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, and Ian Arawjo. 2024. Who Validates the Validators? Align- ing LLM-Assisted Evaluation of LLM Outputs with Human Preferences. arXiv:2404.12272 [cs]

work page arXiv 2024
[54]

Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi

Murtuza N. Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi

work page
[55]

A Human-Centric Take on Model Monitoring

work page
[56]

Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. 2023. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence-to- Sequence Tasks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, 8776–8788. https://doi.org/10.186...

work page doi:10.18653/v1/2023.emnlp-main.543 2023
[57]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi et al. Srivastava. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://doi.org/10.48550/arXiv. 2206.04615 arXiv:2206.04615 [cs, stat]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[58]

Tammy Y. C. Tam, Sumathy Sivarajkumar, Shauna Kapoor, et al . 2024. A framework for human evaluation of large language models in healthcare de- rived from literature review.npj Digital Medicine7, 1 (2024), 258. https: //doi.org/10.1038/s41746-024-01258-7

work page doi:10.1038/s41746-024-01258-7 2024
[59]

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...

work page 2025
[60]

Ashok Urlana, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, Ajeet Ku- mar Singh, and Rahul Mishra. 2025. No Size Fits All: The Perils and Pitfalls of Leveraging LLMs Vary with Company Size. InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa...

work page 2025
[61]

Willem van der Maden, Derek Lomas, and Paul Hekkert. 2024. Developing and evaluating a design method for positive artificial intelligence.Artificial Intelligence for Engineering Design, Analysis and Manufacturing38 (2024), e14. https://doi.org/10.1017/S0890060424000155

work page doi:10.1017/s0890060424000155 2024
[62]

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019.SuperGLUE: a stickier bench- mark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA

work page 2019
[63]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (Eds.). Association for...

work page doi:10.18653/v1/w18-5446 2018
[64]

Industry Practitioners Perspectives on AI Model Quality: Perceptions, Challenges, and Solutions

Chenyu Wang, Zhou Yang, Zewei Li, Daniela E. Damian, and David Lo. 2024. Quality Assurance for Artificial Intelligence: A Study of Industrial Concerns, Challenges and Best Practices.ArXivabs/2402.16391 (2024). https://doi.org/10. 48550/arXiv.2402.16391

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Jiyao Wang, Haolong Hu, Zuyuan Wang, Song Yan, Youyu Sheng, and Dengbo He

work page
[66]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24)

Evaluating Large Language Models on Academic Literature Understanding and Review: An Empirical Study among Early-stage Scholars. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 12, 18 pages. https://doi.org/10.1145/3613904.3641917

work page doi:10.1145/3613904.3641917 2024
[67]

Jiayin Wang, Fengran Mo, Weizhi Ma, Peijie Sun, Min Zhang, and Jian-Yun Nie. 2024. A User-Centric Benchmark for Evaluating Large Language Models. arXiv:2404.13940 [cs]

work page arXiv 2024
[68]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understand- ing Benchmark. InAdvances in Neural Information Processing Systems 37 (NeurIPS

work page 2024
[69]

Datasets and Benchmarks Track

work page
[70]

Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William Isaac. 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv:2310.11986 [cs]

work page arXiv 2023
[71]

World Economic Forum and LinkedIn. 2025. Gender Parity in the Intelligent Age. [Accessed: 2025-09-01]

work page 2025
[72]

Vera Liao

Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. 2023. Evaluating Eval- uation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...

work page doi:10.18653/v1/2023.emnlp-main.676 2023
[73]

Hashimoto

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2024. Benchmarking Large Language Models for News Summarization.Transactions of the Association for Computational Linguistics12 (2024), 39–57. https://doi.org/10.1162/tacl_a_00632

work page doi:10.1162/tacl_a_00632 2024
[74]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems36 (2023), 46595–46623

work page 2023
[75]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. https://doi.org/10.48550/arXiv.2306.05685 arXiv:2306.05685 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[76]

Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daumé III, Kaheer Suleman, and Alexandra Olteanu. 2022. Deconstructing NLG Evaluation: Evaluation Prac- tices, Assumptions, and Their Implications. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpu...

work page doi:10.18653/v1/2022.naacl-main.24 2022
[77]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

work page
[78]

InProceedings of the Thirteenth International Conference on Learning Representations

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InProceedings of the Thirteenth International Conference on Learning Representations

work page