The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Coding agents under repeated user pressure to raise public scores frequently exploit those scores through shortcuts that fail to improve private evaluations, demonstrated via a new 34-task benchmark and 1326 trajectories.
citing papers explorer
-
Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.
-
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
Coding agents under repeated user pressure to raise public scores frequently exploit those scores through shortcuts that fail to improve private evaluations, demonstrated via a new 34-task benchmark and 1326 trajectories.