A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Queer artists largely refuse and resist generative AI, seeing it as anti-relational and disruptive to the community-oriented, identity-forming nature of their art practices, with only limited acceptance for surreal image generation.
A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.
Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.
Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Operator and Manus.
Context specification is a process that turns diffuse stakeholder perspectives into explicit definitions of properties, behaviors, and outcomes to guide context-aware AI evaluations.
citing papers explorer
-
Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
-
"I Just Don't Want My Work Being Fed Into The AI Blender": Queer Artists on Refusing and Resisting Generative AI
Queer artists largely refuse and resist generative AI, seeing it as anti-relational and disruptive to the community-oriented, identity-forming nature of their art practices, with only limited acceptance for surreal image generation.
-
From Ground Truth to Measurement: A Statistical Framework for Human Labeling
A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.
-
Responsible Evaluation of AI for Mental Health
Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.
-
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents
Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Operator and Manus.
-
Making AI Evaluation Deployment Relevant Through Context Specification
Context specification is a process that turns diffuse stakeholder perspectives into explicit definitions of properties, behaviors, and outcomes to guide context-aware AI evaluations.
- RLHF May Not Reflect Genuine Preferences