Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561

Hanna Wallach, Meera Desai, A Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P Alex Dow, et al · 2025 · arXiv 2502.00561

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1 other 1

citation-polarity summary

background 1 unclear 1

representative citing papers

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

cs.HC · 2026-05-08 · unverdicted · novelty 6.0

A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.

"I Just Don't Want My Work Being Fed Into The AI Blender": Queer Artists on Refusing and Resisting Generative AI

cs.HC · 2026-04-15 · unverdicted · novelty 6.0

Queer artists largely refuse and resist generative AI, seeing it as anti-relational and disruptive to the community-oriented, identity-forming nature of their art practices, with only limited acceptance for surreal image generation.

From Ground Truth to Measurement: A Statistical Framework for Human Labeling

stat.ME · 2026-04-08 · unverdicted · novelty 6.0

A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.

Responsible Evaluation of AI for Mental Health

cs.CY · 2026-01-20 · unverdicted · novelty 6.0

Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.

Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

cs.HC · 2025-09-18 · unverdicted · novelty 5.0

Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Operator and Manus.

Making AI Evaluation Deployment Relevant Through Context Specification

cs.AI · 2026-03-06 · unverdicted · novelty 4.0

Context specification is a process that turns diffuse stakeholder perspectives into explicit definitions of properties, behaviors, and outcomes to guide context-aware AI evaluations.

RLHF May Not Reflect Genuine Preferences

cs.HC · 2026-01-31

citing papers explorer

Showing 7 of 7 citing papers.

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios cs.HC · 2026-05-08 · unverdicted · none · ref 58
A repeatable worksheet and human-reviewed expansion process turns expert-elicited AI use cases into 107 grounded scenarios to support consistent human-centered evaluations.
"I Just Don't Want My Work Being Fed Into The AI Blender": Queer Artists on Refusing and Resisting Generative AI cs.HC · 2026-04-15 · unverdicted · none · ref 138
Queer artists largely refuse and resist generative AI, seeing it as anti-relational and disruptive to the community-oriented, identity-forming nature of their art practices, with only limited acceptance for surreal image generation.
From Ground Truth to Measurement: A Statistical Framework for Human Labeling stat.ME · 2026-04-08 · unverdicted · none · ref 25
A statistical framework decomposes human annotation outcomes into four interpretable variation sources and extends classical measurement-error models to handle both shared and individualized notions of truth.
Responsible Evaluation of AI for Mental Health cs.CY · 2026-01-20 · unverdicted · none · ref 18
Proposes an interdisciplinary framework and taxonomy for responsible evaluation of AI mental health tools based on analysis of 135 publications identifying gaps in metrics, expert involvement, safety, and equity.
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents cs.HC · 2025-09-18 · unverdicted · none · ref 78
Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Operator and Manus.
Making AI Evaluation Deployment Relevant Through Context Specification cs.AI · 2026-03-06 · unverdicted · none · ref 4
Context specification is a process that turns diffuse stakeholder perspectives into explicit definitions of properties, behaviors, and outcomes to guide context-aware AI evaluations.
RLHF May Not Reflect Genuine Preferences cs.HC · 2026-01-31 · unreviewed · ref 11

Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer