Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

· 2025 · cs.HC · arXiv 2511.05501

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Benchmarks play a significant role in how technology companies communicate about model capabilities and how researchers and the public understand generative AI systems. However, existing benchmarks have been criticized for their failure to adequately capture real-world usages (i.e. ecological validity) or to measure underlying concepts (i.e. construct validity). Building on approaches in HCI, we adopt a human-centered design process to address such critiques. Working within the journalism domain we engaged 23 professionals in a workshop which informed the design of a domain-oriented evaluation ``cookbook''. Our workshop findings surface domain-specific challenges and tensions faced by designers in translating specific tasks to evaluation constructs, aligning metrics with domain-specific values, and balancing needs among different stakeholders when constructing evaluations. Through an instantiation of design-based approaches for benchmark creation in the journalism domain, this work not only produces an evaluation structure for journalism practitioners to experiment with, but also lays out design requirements for AI evaluations that are contextualized, value-aligned, and cultivate evaluative literacy for domain end-users.

representative citing papers

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

cs.CY · 2026-05-21 · conditional · novelty 6.0

Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.

citing papers explorer

Showing 1 of 1 citing paper.

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions cs.CY · 2026-05-21 · conditional · none · ref 7 · internal anchor
Healthcare LLM benchmarks overlook implicit assumptions about user behavior that split into task assumptions testable from conversation data and outcome assumptions requiring behavioral studies, shown by reanalyzing an RCT where both gaps are roughly equal.

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

fields

years

verdicts

representative citing papers

citing papers explorer