Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
citing papers explorer
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.