{"paper":{"title":"Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Aarush Sinha, Amit Saha, Anastassia Kornilova, Andrea Loehr, Andrew Tran, Anka Reuel, Anoop Mishra, Aris Hofmann, Asaf Yehudai, Avijit Ghosh, Damian Stachura, David Manheim, Drishti Sharma, Eliya Habba, Felix Friedrich, Irene Solaiman, Ishan Khire, Jan Batzner, Jeba Sania, Jennifer Mickel, Jenny Chim, Jessica Ji, Kabir Manghnani, Kevin Klyman, Leshem Choshen, Max Lamparth, Michael Alexander Riegler, Michael Hardy, Michelle Lin, Mubashara Akhtar, Mykel Kochenderfer, Nathan Heath, Nuno Moniz, Ruchira Dhar, Sanmi Koyejo, Shalaleh Rismani, Sree Harsha Nelaturu, Srishti Yadav, Stella Biderman, Subramanyam Sahoo, Usman Gohar, Wm. Matthew Kennedy, Yacine Jernite, Yanan Jiang, Yanan Long, Yilin Huang, Yixiong Hao, Zeerak Talat","submitted_at":"2026-06-08T17:55:02Z","abstract_excerpt":"AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2606.09809","kind":"arxiv","version":1},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2606.09809/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}