{"paper":{"title":"Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results","license":"http://creativecommons.org/licenses/by/4.0/","headline":"","cross_cats":["cs.CL","cs.CY"],"primary_cat":"cs.AI","authors_text":"Anastassia Kornilova, Andrew M. Bean, Aniketh Garikaparthi, Anka Reuel, Arman Cohan, Asaf Yehudai, Asteria Kaeberlein, Austin Meek, Avijit Ghosh, Brian H. Hu, Brian Wingenroth, Chang Liu, Damian Stachura, Eliya Habba, Elron Bandel, Felix Friedrich, Gjergji Kasneci, Irene Solaiman, James Edgell, Jan Batzner, Jatin Ganhotra, Jennifer Mickel, Jenny Chim, John P. Lalor, Jon Crall, Leon Knauer, Leshem Choshen, Manasi Patwardhan, Marek \\v{S}uppa, Martin Ku, Michelle Lin, Mubashara Akhtar, Oluwagbemike Olowe, Patrick Meusling, Saki Imai, Sanchit Ahuja, Sander Land, Sree Harsha Nelaturu, Srishti Yadav, Stella Biderman, Steven Dillmann, Tommaso Cerruti, Usman Gohar, Venkata Ramachandra Karthik Chundi, Wm. Matthew Kennedy, Yanan Long, Yifan Mai, Zeerak Talat","submitted_at":"2026-06-12T14:47:37Z","abstract_excerpt":"AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eva"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2606.14516","kind":"arxiv","version":1},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2606.14516/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}