Recognition: no theorem link
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Pith reviewed 2026-05-16 11:23 UTC · model grok-4.3
The pith
Practitioners evaluating LLM products gather data they cannot turn into concrete improvements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices and introduce the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. We contribute strategies to bridge this gap, drawn from patterns in successful teams, and argue that interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures.
What carries the argument
The results-actionability gap: the observed disconnect in which evaluation data on LLM products is collected but cannot be converted into specific, actionable product improvements.
If this is right
- LLM product teams rely on a spectrum of practices from informal checks to formal meta-work.
- Successful teams develop organizational processes to link evaluation results to design changes.
- Interpretive practices arise as adaptations to the unpredictable behavior of LLMs.
- HCI research should support systematizing current practices instead of replacing them with new frameworks.
Where Pith is reading between the lines
- Product teams could benefit from lightweight tools that map evaluation signals directly to candidate fixes.
- The gap may lengthen iteration cycles for LLM features in production environments.
- Similar actionability problems could appear when evaluating other non-deterministic AI systems.
Load-bearing premise
That the patterns seen in interviews with nineteen practitioners generalize to most LLM product teams and reflect necessary adaptations to LLM traits rather than avoidable team shortcomings.
What would settle it
A larger study that finds most LLM product teams routinely convert evaluation data into specific improvements without special bridging strategies would falsify the gap as a widespread phenomenon.
read the original abstract
How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents findings from interviews with 19 practitioners across sectors on how they evaluate LLM-powered products. It identifies ten evaluation practices (from informal 'vibe checks' to organizational meta-work), confirms four previously documented challenges, and introduces a novel 'results-actionability gap' in which teams collect evaluation data but cannot translate it into concrete product improvements. Drawing on patterns from successful teams, it proposes strategies to bridge the gap and frames the observed ad-hoc interpretive practices as necessary adaptations to LLM characteristics rather than methodological shortcomings, while highlighting opportunities for HCI research to support systematization.
Significance. If the findings hold, the work offers timely empirical grounding for the practical realities of LLM product evaluation in industry, documenting a previously unarticulated results-actionability gap and surfacing actionable strategies. This contributes to HCI and SE by shifting focus from new frameworks to supporting the formalization of existing practices, with potential to inform tool design and organizational processes for handling non-deterministic systems.
major comments (3)
- [Methods] Methods section: the description of the interview protocol, participant selection criteria, coding process (including how themes were derived and iterated), and any inter-rater reliability measures is absent or insufficiently detailed. Because the identification of the ten practices and the results-actionability gap rests entirely on this thematic analysis, the lack of these details prevents full assessment of the analysis's rigor and reproducibility.
- [Findings/Discussion] Findings and Discussion sections: the assertion that ad-hoc practices such as vibe checks are 'necessary adaptations to LLM characteristics' rather than 'methodological failures' is presented without distinguishing evidence. No comparative cases (e.g., teams using more systematic methods), external benchmarks, or falsification attempts are provided; the distinction is drawn solely from interpretive patterns in the 19-interview sample, weakening the novelty and framing of the results-actionability gap.
- [Findings] §4 (or equivalent findings section): the generalization that the observed patterns and strategies apply beyond the sample to 'LLM product teams' broadly is not supported by discussion of sample limitations, saturation criteria, or transferability considerations, which is load-bearing for the claim that the gap is a widespread phenomenon requiring specific bridging strategies.
minor comments (2)
- [Abstract] Abstract: the phrase 'our analysis suggests these interpretive practices are necessary adaptations' could be qualified to reflect that this is an interpretive claim from the sample rather than a tested causal finding.
- The manuscript would benefit from a table summarizing the ten practices with brief definitions and example quotes to improve readability and allow readers to map them directly to the results-actionability gap.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the paper's empirical contributions to understanding LLM product evaluation practices. We have addressed each major comment by expanding methodological transparency, strengthening the evidential basis for interpretive claims, and adding explicit discussion of sample limitations and transferability. All revisions will be incorporated in the next manuscript version.
read point-by-point responses
-
Referee: [Methods] Methods section: the description of the interview protocol, participant selection criteria, coding process (including how themes were derived and iterated), and any inter-rater reliability measures is absent or insufficiently detailed. Because the identification of the ten practices and the results-actionability gap rests entirely on this thematic analysis, the lack of these details prevents full assessment of the analysis's rigor and reproducibility.
Authors: We agree that the original Methods section lacked sufficient detail for full assessment of rigor. In the revised manuscript we will expand this section to include: the full semi-structured interview protocol with example prompts; explicit participant selection criteria (roles, years of experience, sectors, and LLM product involvement); the thematic analysis process following Braun and Clarke, including how initial codes were generated, how themes were iteratively derived and refined across the dataset, and the specific steps taken to ensure consistency (multiple authors independently coded a subset of transcripts, followed by joint discussion to resolve discrepancies). revision: yes
-
Referee: [Findings/Discussion] Findings and Discussion sections: the assertion that ad-hoc practices such as vibe checks are 'necessary adaptations to LLM characteristics' rather than 'methodological failures' is presented without distinguishing evidence. No comparative cases (e.g., teams using more systematic methods), external benchmarks, or falsification attempts are provided; the distinction is drawn solely from interpretive patterns in the 19-interview sample, weakening the novelty and framing of the results-actionability gap.
Authors: We acknowledge that the original framing presented the interpretation without sufficient supporting excerpts or contrast. In revision we will add direct participant quotes illustrating how teams that attempted more structured quantitative methods still encountered non-determinism barriers that rendered results non-actionable, thereby grounding the 'necessary adaptation' claim in the data. We will also revise the language to present this as an interpretation emerging from the observed patterns rather than a definitive causal claim, while preserving the novelty of the results-actionability gap as the core contribution. revision: partial
-
Referee: [Findings] §4 (or equivalent findings section): the generalization that the observed patterns and strategies apply beyond the sample to 'LLM product teams' broadly is not supported by discussion of sample limitations, saturation criteria, or transferability considerations, which is load-bearing for the claim that the gap is a widespread phenomenon requiring specific bridging strategies.
Authors: We agree that the manuscript should more explicitly address generalizability. The revised version will include a new Limitations subsection that reports sample characteristics (19 practitioners across 12 organizations and 5 sectors), describes how thematic saturation was assessed (no new themes emerged after the 14th interview, with the remaining five confirming existing themes), and discusses transferability considerations, including the diversity of the sample while cautioning that findings represent patterns observed in this group rather than universal claims. The bridging strategies will be framed as empirically derived suggestions rather than prescriptive for all teams. revision: yes
Circularity Check
No circularity: purely empirical qualitative study with independent observations
full rationale
The paper reports thematic analysis from 19 practitioner interviews, identifying practices and introducing the 'results-actionability gap' as an observed pattern. No equations, derivations, fitted parameters, or self-citations appear that reduce any claim to prior inputs by construction. The interpretive suggestion that practices are 'necessary adaptations' is framed as analysis of the sample data rather than a formal reduction or uniqueness theorem. The work is self-contained against external benchmarks as an exploratory qualitative contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Qualitative data from 19 practitioner interviews accurately captures real-world evaluation practices and challenges.
invented entities (1)
-
results-actionability gap
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2001.Introduction to measurement theory
Mary J Allen and Wendy M Yen. 2001.Introduction to measurement theory. Waveland Press, Long Grove, IL
work page 2001
-
[2]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300
work page 2019
-
[4]
Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine36, 1 (2015), 15–24
work page 2015
-
[5]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexan- der Koller, Andre Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, and Alberto Testoni. 2025. LLMs ins...
-
[6]
Anja Belz and Ehud Reiter. 2006. Comparing Automatic and Human Evaluation of NLG Systems. In11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy, 313–320
work page 2006
-
[7]
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. 2024. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782(2024)
work page internal anchor Pith review arXiv 2024
-
[8]
2022.Thematic Analysis: A Practical Guide
Virginia Braun and Victoria Clarke. 2022.Thematic Analysis: A Practical Guide. Sage, London, UK
work page 2022
-
[9]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models.ACM Trans. Intell. Syst. Technol.15, 3, Article 39 (March 2024), 45 pages. https://doi.org/10.1145/3641289
- [10]
-
[11]
Katherine M. Collins, Albert Q. Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B. Tenenbaum, William Hart, Timothy Gowers, Wenda Li, Adrian Weller, and Mateja Jamnik. 2024. Evaluating language models for mathematics through interactions.Proceed- ings of the National Academy of Sciences121, 24 (June 2024),...
-
[12]
Bennett, Gary Hsieh, and Sean A
Lucas Colusso, Cynthia L. Bennett, Gary Hsieh, and Sean A. Munson. 2017. Translational Resources: Reducing the Gap Between Academic Research and HCI Practice. InProceedings of the 2017 Conference on Designing Interactive Systems (DIS ’17). ACM, New York, NY, USA, 957–968
work page 2017
-
[13]
Ward Cunningham. 1992. The WyCash Portfolio Management System. In Addendum to the Proceedings on Object-oriented Programming Systems, Lan- guages, and Applications (OOPSLA ’92). ACM, New York, NY, USA, 29–30. https://doi.org/10.1145/157709.157715
-
[14]
Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics10 (2022), 92–110
work page 2022
-
[15]
Deloitte AI Institute. 2021. Women in AI: Infographic. https://www2.deloitte. com/content/dam/Deloitte/us/Documents/deloitte-analytics/us-consulting-ai- institute-women-in-ai-infographic.pdf. [Accessed: 2025-09-01]
work page 2021
-
[16]
Finale Doshi-Velez and Been Kim. 2017. Towards a Rigorous Science of Inter- pretable Machine Learning.arXiv preprint arXiv:1702.08608(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Aparna Elangovan, Ling Liu, Lei Xu, Sravan Babu Bodapati, and Dan Roth. 2024. ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). As...
-
[18]
Edona Elshan, Christian Engel, Philipp Ebel, and Dominik Siemon. 2022. Assess- ing the Reusability of Design Principles in the Realm of Conversational Agents. In The Transdisciplinary Reach of Design Science Research: 17th International Confer- ence on Design Science Research in Information Systems and Technology, DESRIST 2022, St Petersburg, FL, USA, Jun...
-
[19]
Steven Fokkinga, Pieter Desmet, and Paul Hekkert. 2020. Impact-centered design: Introducing an integrated framework of the psychological and behavioral effects of design.International Journal of Design14, 3 (2020), 97
work page 2020
-
[20]
Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P
Shannon K. Gallagher, Jasmine Ratchford, Tyler Brooks, Bryan P. Brown, Eric Heim, William R. Nichols, Scott Mcmillan, Swati Rallapalli, Carol J. Smith, Nathan Vanhoudnos, Nick Winski, and Andrew O. Mellinger. 2024. Assessing LLMs for High Stakes Applications. InProceedings of the 46th International Conference on Software Engineering: Software Engineering ...
-
[21]
Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. 2023. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text.Journal of Artificial Intelligence Research77 (May 2023), 103–166. https: //doi.org/10.1613/jair.1.13715
-
[23]
Steve Harrison, Deborah Tatar, and Phoebe Sengers. 2007. The three paradigms of HCI(alt.CHI ’07). Association for Computing Machinery, New York, NY, USA, 1–18
work page 2007
-
[24]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Un- derstanding. https://doi.org/10.48550/arXiv.2009.03300 arXiv:2009.03300 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2009.03300 2021
-
[25]
Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–16. https:...
-
[26]
Lujain Ibrahim, Saffron Huang, Lama Ahmad, and Markus Anderljung. 2024. Beyond static AI evaluations: advancing human interaction evaluations for LLM harms and risks. https://doi.org/10.48550/arXiv.2405.10632 arXiv:2405.10632 [cs]
-
[27]
Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. InPro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada)(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 375–385. https://doi.org/10.1145/3442188.3445901
-
[28]
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. 2024. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit...
-
[29]
Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ash- win Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E. Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael Bernstein, and Percy Liang. 2023. Evaluating Human- Language Model Interaction. arXiv:2212.09746 [cs]. CHI ’26, A...
-
[30]
Alexia Leibbrandt. 2021. Women and the Digital Revolution. InUNESCO Science Report: The Race Against Time for Smarter Development. UNESCO, Chapter 3. [Accessed: 2025-09-01]
work page 2021
-
[31]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211 2023
-
[32]
Q. Vera Liao and Ziang Xiao. 2023. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. arXiv:2306.03100 [cs]
-
[33]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2024. AgentBench: Evaluating LLMs as Agents. InProceedings of the Twelfth International Conference on Learning Representations
work page 2024
-
[34]
Yixin Liu, Alexander Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, Pengfei Liu, Dragomir Radev, Chien-Sheng Wu, and Arman Cohan. 2024. Bench- marking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization. InFindings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Hel...
-
[35]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. http...
- [36]
-
[37]
Vera Liao, Alexandra Olteanu, and Ziang Xiao
Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, and Ziang Xiao. 2024. ECBD: Evidence-Centered Benchmark Design for NLP
work page 2024
-
[38]
Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. InPro- ceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI(Lisbon, Portugal)(CAIN ’24). Association for Computing Ma- chinery, New York, NY, USA, 166–171. https://do...
-
[39]
Alina Mailach, Sebastian Simon, Johannes Dorn, and Norbert Siegmund. 2025. Themes of Building LLM-Based Applications for Production: A Practitioner’s View . In2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE Computer Society, Los Alamitos, CA, USA, 18–30. https://doi.org/10.1109/CAIN66642.2025.00011
-
[40]
Raiza Martin and Usama Bin Shafqat. 2024. How NotebookLM Was Made. Latent Space podcast. https://www.latent.space/p/notebooklm
work page 2024
-
[41]
Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. https://doi.org/10.1109/TAI.2025.3569516
-
[42]
Samuel Messick. 1995. Validity of psychological assessment: Validation of infer- ences from persons’ responses and performances as scientific inquiry into score meaning.American Psychologist50, 9 (1995), 741–749
work page 1995
-
[43]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika ...
work page 2023
-
[44]
Nadia Nahar, Christian Kastner, Jenna Butler, Chris Parnin, Thomas Zimmer- mann., and Christian Bird. 2025. Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products . In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). IEEE Computer Soci...
-
[45]
David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models
work page 2022
-
[46]
Donald A Norman and Pieter Jan Stappers. 2016. DesignX: complex sociotechnical systems.She Ji: The Journal of Design, Economics, and Innovation1, 2 (2016), 83–106
work page 2016
- [47]
-
[48]
Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. 2025. Building Your Own Product Copilot: Challenges, Op- portunities, and Needs. In2025 IEEE International Conference on Software Anal- ysis, Evolution and Reengineering (SANER). 338–348. https://doi.org/10.1109/ SANER64311.2025.00039
-
[49]
Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: defining an end-to-end frame- work for internal algorithmic auditing. InProceedings of the 2020 Conference on Fairness, Accountability, and Transparency(Barcel...
-
[50]
David J. Roedl and Erik Stolterman. 2013. Design research at CHI and its applicabil- ity to design practice. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems(Paris, France)(CHI ’13). Association for Computing Machin- ery, New York, NY, USA, 1951–1954. https://doi.org/10.1145/2470654.2466257
-
[51]
Malak Sadek, Rafael A Calvo, and Celine Mougenot. 2023. Trends, Challenges and Processes in Conversational Agent Design: Exploring Practitioners’ Views through Semi-Structured Interviews. InProceedings of the 5th International Con- ference on Conversational User Interfaces(Eindhoven, Netherlands)(CUI ’23). Association for Computing Machinery, New York, NY...
- [52]
- [53]
-
[54]
Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi
Murtuza N. Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi
-
[55]
A Human-Centric Take on Model Monitoring
-
[56]
Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan. 2023. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence-to- Sequence Tasks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Singapore, 8776–8788. https://doi.org/10.186...
-
[57]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi et al. Srivastava. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. https://doi.org/10.48550/arXiv. 2206.04615 arXiv:2206.04615 [cs, stat]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
-
[58]
Tammy Y. C. Tam, Sumathy Sivarajkumar, Shauna Kapoor, et al . 2024. A framework for human evaluation of large language models in healthcare de- rived from literature review.npj Digital Medicine7, 1 (2024), 258. https: //doi.org/10.1038/s41746-024-01258-7
-
[59]
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the Judges: Evaluating Align- ment and Vulnerabilities in LLMs-as-Judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), Ofir Arviv, Miruna Clinciu, Kaus- tubh Dhole, Rotem Dror, Sebastian Gehrmann, E...
work page 2025
-
[60]
Ashok Urlana, Charaka Vinayak Kumar, Bala Mallikarjunarao Garlapati, Ajeet Ku- mar Singh, and Rahul Mishra. 2025. No Size Fits All: The Perils and Pitfalls of Leveraging LLMs Vary with Company Size. InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa...
work page 2025
-
[61]
Willem van der Maden, Derek Lomas, and Paul Hekkert. 2024. Developing and evaluating a design method for positive artificial intelligence.Artificial Intelligence for Engineering Design, Analysis and Manufacturing38 (2024), e14. https://doi.org/10.1017/S0890060424000155
-
[62]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019.SuperGLUE: a stickier bench- mark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA
work page 2019
-
[63]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (Eds.). Association for...
-
[64]
Industry Practitioners Perspectives on AI Model Quality: Perceptions, Challenges, and Solutions
Chenyu Wang, Zhou Yang, Zewei Li, Daniela E. Damian, and David Lo. 2024. Quality Assurance for Artificial Intelligence: A Study of Industrial Concerns, Challenges and Best Practices.ArXivabs/2402.16391 (2024). https://doi.org/10. 48550/arXiv.2402.16391
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Jiyao Wang, Haolong Hu, Zuyuan Wang, Song Yan, Youyu Sheng, and Dengbo He
-
[66]
Evaluating Large Language Models on Academic Literature Understanding and Review: An Empirical Study among Early-stage Scholars. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 12, 18 pages. https://doi.org/10.1145/3613904.3641917
- [67]
-
[68]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understand- ing Benchmark. InAdvances in Neural Information Processing Systems 37 (NeurIPS
work page 2024
-
[69]
Datasets and Benchmarks Track
-
[70]
Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, and William Isaac. 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv:2310.11986 [cs]
-
[71]
World Economic Forum and LinkedIn. 2025. Gender Parity in the Intelligent Age. [Accessed: 2025-09-01]
work page 2025
-
[72]
Ziang Xiao, Susu Zhang, Vivian Lai, and Q. Vera Liao. 2023. Evaluating Eval- uation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory. InProceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...
-
[73]
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2024. Benchmarking Large Language Models for News Summarization.Transactions of the Association for Computational Linguistics12 (2024), 39–57. https://doi.org/10.1162/tacl_a_00632
-
[74]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems36 (2023), 46595–46623
work page 2023
-
[75]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. https://doi.org/10.48550/arXiv.2306.05685 arXiv:2306.05685 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
-
[76]
Kaitlyn Zhou, Su Lin Blodgett, Adam Trischler, Hal Daumé III, Kaheer Suleman, and Alexandra Olteanu. 2022. Deconstructing NLG Evaluation: Evaluation Prac- tices, Assumptions, and Their Implications. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpu...
-
[77]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
-
[78]
InProceedings of the Thirteenth International Conference on Learning Representations
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. InProceedings of the Thirteenth International Conference on Learning Representations
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.