{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:EFORJFNADYDAH7LN4WM5CM4KXJ","short_pith_number":"pith:EFORJFNA","schema_version":"1.0","canonical_sha256":"215d1495a01e0603fd6de599d1338aba51bb268ef28fcafdd762511341b632d9","source":{"kind":"arxiv","id":"2605.04135","version":2},"attestation_state":"computed","paper":{"title":"Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Academic LLM evaluations test models 10.85 ECI behind the frontier on average, with the lag widening and frequent overgeneralization to claims about AI.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CY","authors_text":"David Gringras, Misha Salahshoor","submitted_at":"2026-05-05T17:58:35Z","abstract_excerpt":"Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \"AI\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2605.04135","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CY","submitted_at":"2026-05-05T17:58:35Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"8cbb7c7abfe2f71c714433ec4e0464b603e5ecd19e4d1c6cafd4f19d670db941","abstract_canon_sha256":"4fb1f0e94802056817ba98e883a914168127740ac53bb5c0f1c6b600d686e278"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-06-05T00:13:46.746931Z","signature_b64":"+UJdovFwmpabNajKWTcvT2YPlV8ygs+sO/Domg8oc1Kbz91Cct01kv3XR2La7mR4479cSzFE7Yd3DP9giPyMDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"215d1495a01e0603fd6de599d1338aba51bb268ef28fcafdd762511341b632d9","last_reissued_at":"2026-06-05T00:13:46.746257Z","signature_status":"signed_v1","first_computed_at":"2026-06-05T00:13:46.746257Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Academic LLM evaluations test models 10.85 ECI behind the frontier on average, with the lag widening and frequent overgeneralization to claims about AI.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CY","authors_text":"David Gringras, Misha Salahshoor","submitted_at":"2026-05-05T17:58:35Z","abstract_excerpt":"Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about \"AI\" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The median paper evaluates a model +10.85 ECI behind the contemporaneous frontier at evaluation time (H1); the gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Only 3.2% of abstracts disclose reasoning-mode status and 52.5% state conclusions at the level of 'AI'.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the keyword-based sampling of 112,303 records and the 18,574 admissible papers form a representative sample of the LLM evaluation literature, and that the reproduced Epoch AI Capabilities Index accurately ranks models at the time of each paper's evaluation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Academic LLM papers lag the frontier by a median 10.85 ECI points at publication time, with the gap widening 5.53 ECI per year, low disclosure of reasoning modes, and frequent overgeneralization to 'AI'.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Academic LLM evaluations test models 10.85 ECI behind the frontier on average, with the lag widening and frequent overgeneralization to claims about AI.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"315adfcdaf2ba9686660836aa3647c53a7ba68a4754a5566ac50eec108e26fd0"},"source":{"id":"2605.04135","kind":"arxiv","version":2},"verdict":{"id":"7f4e6d4f-6856-43fa-978b-59b9e9aba46a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-08T18:44:38.989463Z","strongest_claim":"The median paper evaluates a model +10.85 ECI behind the contemporaneous frontier at evaluation time (H1); the gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Only 3.2% of abstracts disclose reasoning-mode status and 52.5% state conclusions at the level of 'AI'.","one_line_summary":"Academic LLM papers lag the frontier by a median 10.85 ECI points at publication time, with the gap widening 5.53 ECI per year, low disclosure of reasoning modes, and frequent overgeneralization to 'AI'.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the keyword-based sampling of 112,303 records and the 18,574 admissible papers form a representative sample of the LLM evaluation literature, and that the reproduced Epoch AI Capabilities Index accurately ranks models at the time of each paper's evaluation.","pith_extraction_headline":"Academic LLM evaluations test models 10.85 ECI behind the frontier on average, with the lag widening and frequent overgeneralization to claims about AI."},"integrity":{"clean":false,"summary":{"advisory":1,"critical":0,"by_detector":{"doi_compliance":{"total":1,"advisory":1,"critical":0,"informational":0}},"informational":0},"endpoint":"/pith/2605.04135/integrity.json","findings":[{"note":"DOI in the printed bibliography is fragmented by whitespace or line breaks. A longer candidate (10.3348/kjr.2024.1161.Closest) was visible in the surrounding text but could not be confirmed against doi.org as printed.","detector":"doi_compliance","severity":"advisory","ref_index":3,"audited_at":"2026-05-19T14:49:23.071985Z","detected_doi":"10.3348/kjr.2024.1161.Closest","finding_type":"recoverable_identifier","verdict_class":"incontrovertible","detected_arxiv_id":null}],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-20T12:38:11.657641Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-20T00:01:21.003978Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T14:49:23.071985Z","status":"completed","version":"1.0.0","findings_count":1}],"snapshot_sha256":"cfad5a251c958bdd00b2ba15fff7ea4cea450469900d166993f21169745eab69"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"51cd33fef36a68757f90c6e67330952cb6f4f597d4c7659a547d1b9a765a8b17"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.04135","created_at":"2026-06-05T00:13:46.746365+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.04135v2","created_at":"2026-06-05T00:13:46.746365+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.04135","created_at":"2026-06-05T00:13:46.746365+00:00"},{"alias_kind":"pith_short_12","alias_value":"EFORJFNADYDA","created_at":"2026-06-05T00:13:46.746365+00:00"},{"alias_kind":"pith_short_16","alias_value":"EFORJFNADYDAH7LN","created_at":"2026-06-05T00:13:46.746365+00:00"},{"alias_kind":"pith_short_8","alias_value":"EFORJFNA","created_at":"2026-06-05T00:13:46.746365+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":1,"internal_anchor_count":1,"sample":[{"citing_arxiv_id":"2606.08529","citing_title":"Scaffold Effects on GAIA: A Controlled Comparison","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ","json":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ.json","graph_json":"https://pith.science/api/pith-number/EFORJFNADYDAH7LN4WM5CM4KXJ/graph.json","events_json":"https://pith.science/api/pith-number/EFORJFNADYDAH7LN4WM5CM4KXJ/events.json","paper":"https://pith.science/paper/EFORJFNA"},"agent_actions":{"view_html":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ","download_json":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ.json","view_paper":"https://pith.science/paper/EFORJFNA","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.04135&json=true","fetch_graph":"https://pith.science/api/pith-number/EFORJFNADYDAH7LN4WM5CM4KXJ/graph.json","fetch_events":"https://pith.science/api/pith-number/EFORJFNADYDAH7LN4WM5CM4KXJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ/action/storage_attestation","attest_author":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ/action/author_attestation","sign_citation":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ/action/citation_signature","submit_replication":"https://pith.science/pith/EFORJFNADYDAH7LN4WM5CM4KXJ/action/replication_record"}},"created_at":"2026-06-05T00:13:46.746365+00:00","updated_at":"2026-06-05T00:13:46.746365+00:00"}