← back to paper
arxiv: 2604.03244 · 2 revisions
AI Evaluation Should Require Standardized Item-Level Data Releases