{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2026:PTMIVYF57TOIQIPUDUGX2MICTO","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"c15f1938ef9dde2ace8989fb774fd69b3b0fbe6d4a64d90a55ef4877104997bd","cross_cats_sorted":["cs.AI"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-14T21:01:05Z","title_canon_sha256":"0a97817509d89b0754952f0409660f732f9fb2d7b2b5893010e11dfa0c0ee9db"},"schema_version":"1.0","source":{"id":"2605.15416","kind":"arxiv","version":1}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2605.15416","created_at":"2026-05-20T00:00:57Z"},{"alias_kind":"arxiv_version","alias_value":"2605.15416v1","created_at":"2026-05-20T00:00:57Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.15416","created_at":"2026-05-20T00:00:57Z"},{"alias_kind":"pith_short_12","alias_value":"PTMIVYF57TOI","created_at":"2026-05-20T00:00:57Z"},{"alias_kind":"pith_short_16","alias_value":"PTMIVYF57TOIQIPU","created_at":"2026-05-20T00:00:57Z"},{"alias_kind":"pith_short_8","alias_value":"PTMIVYF5","created_at":"2026-05-20T00:00:57Z"}],"graph_snapshots":[{"event_id":"sha256:466cb6f43a171640c597a13772ba448569d42ed9c05fa4bc07e778b6e1039643","target":"graph","created_at":"2026-05-20T00:00:57Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"That training on simulated annotator diversity produces a confidence estimator whose ranking behavior transfers to real human disagreement distributions; the abstract notes the original monotonicity assumption is often violated but does not quantify how well the simulation matches actual human variance."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Introduces a margin-adaptive confidence ranking method that learns an estimator from simulated diversity and derives margin-dependent generalization bounds for use in fixed-sequence testing of LLM-human agreement."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"A learned margin-adaptive confidence estimator improves LLM-human agreement by strengthening the link between confidence scores and disagreement risk."}],"snapshot_sha256":"0942f2d9bc1144f5aee3e295ef8d62f32d2da86018cb100d418aa44a464d5451"},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6c06570b9a4bb9d4373c615b95f0d6dfac7f4a959f6e8b8690e486880383767e"},"integrity":{"available":true,"clean":true,"detectors_run":[{"findings_count":0,"name":"doi_title_agreement","ran_at":"2026-05-19T16:31:18.254382Z","status":"completed","version":"1.0.0"},{"findings_count":0,"name":"cited_work_retraction","ran_at":"2026-05-19T16:23:36.965665Z","status":"completed","version":"1.0.0"},{"findings_count":0,"name":"doi_compliance","ran_at":"2026-05-19T16:15:56.710719Z","status":"completed","version":"1.0.0"},{"findings_count":0,"name":"citation_quote_validity","ran_at":"2026-05-19T15:50:44.018287Z","status":"skipped","version":"0.1.0"},{"findings_count":0,"name":"claim_evidence","ran_at":"2026-05-19T14:21:54.146792Z","status":"completed","version":"1.0.0"},{"findings_count":0,"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.707235Z","status":"skipped","version":"1.0.0"}],"endpoint":"/pith/2605.15416/integrity.json","findings":[],"snapshot_sha256":"43871eea51c57cc73165a3d8b38150fc7add2c7ab1deed5fd564f70264d64277","summary":{"advisory":0,"by_detector":{},"critical":0,"informational":0}},"paper":{"abstract_excerpt":"Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ","authors_text":"Gaojie Jin, Lijia Yu, Tianjin Huang, Yong Tao","cross_cats":["cs.AI"],"headline":"A learned margin-adaptive confidence estimator improves LLM-human agreement by strengthening the link between confidence scores and disagreement risk.","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-14T21:01:05Z","title":"Margin-Adaptive Confidence Ranking for Reliable LLM Judgement"},"references":{"count":300,"internal_anchors":34,"resolved_work":300,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Under review","work_id":"8617eae9-ad9f-45e0-bb23-49df5b429be6","year":null},{"cited_arxiv_id":"2204.05862","doi":"","is_internal_anchor":true,"ref_index":2,"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":3,"title":"Advances in neural information processing systems , volume=","work_id":"bbd406a3-5c71-400c-8be6-a6512f5ba309","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":4,"title":"Learning to summarize with human feedback , author=. NeurIPS , year=","work_id":"ae46b119-171c-4929-942d-42d764a0f2c4","year":null},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":5,"title":"Advances in Neural Information Processing Systems , volume=","work_id":"911dcd66-724a-45e1-8123-c49dd99fbff8","year":null}],"snapshot_sha256":"40d19acec3bca32faca7c2220403777971ad090d05d1a81cdd1eb10e0be904f4"},"source":{"id":"2605.15416","kind":"arxiv","version":1},"verdict":{"created_at":"2026-05-19T16:01:36.024585Z","id":"9f5ace5a-9521-4954-9c81-7bffe0b973b0","model_set":{"reader":"grok-4.3"},"one_line_summary":"Introduces a margin-adaptive confidence ranking method that learns an estimator from simulated diversity and derives margin-dependent generalization bounds for use in fixed-sequence testing of LLM-human agreement.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"A learned margin-adaptive confidence estimator improves LLM-human agreement by strengthening the link between confidence scores and disagreement risk.","strongest_claim":"When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.","weakest_assumption":"That training on simulated annotator diversity produces a confidence estimator whose ranking behavior transfers to real human disagreement distributions; the abstract notes the original monotonicity assumption is often violated but does not quantify how well the simulation matches actual human variance."}},"verdict_id":"9f5ace5a-9521-4954-9c81-7bffe0b973b0"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:0514979ca0fbfe25f36ab9748f12567ed7c88f9cead587426a9e71156671d3c6","target":"record","created_at":"2026-05-20T00:00:57Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"c15f1938ef9dde2ace8989fb774fd69b3b0fbe6d4a64d90a55ef4877104997bd","cross_cats_sorted":["cs.AI"],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2026-05-14T21:01:05Z","title_canon_sha256":"0a97817509d89b0754952f0409660f732f9fb2d7b2b5893010e11dfa0c0ee9db"},"schema_version":"1.0","source":{"id":"2605.15416","kind":"arxiv","version":1}},"canonical_sha256":"7cd88ae0bdfcdc8821f41d0d7d31029b9ae32276c77e3cce68d407239f3108b1","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"7cd88ae0bdfcdc8821f41d0d7d31029b9ae32276c77e3cce68d407239f3108b1","first_computed_at":"2026-05-20T00:00:57.468346Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-20T00:00:57.468346Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"DyZTotFB9n2Zj7Ip5E3CRU2CCNHMaflurZcBmg0a8wPlbAYRMZp6rKCLvRyBqH3yn/KSvzbC7w+SnQvPPBAvAA==","signature_status":"signed_v1","signed_at":"2026-05-20T00:00:57.469262Z","signed_message":"canonical_sha256_bytes"},"source_id":"2605.15416","source_kind":"arxiv","source_version":1}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:0514979ca0fbfe25f36ab9748f12567ed7c88f9cead587426a9e71156671d3c6","sha256:466cb6f43a171640c597a13772ba448569d42ed9c05fa4bc07e778b6e1039643"],"state_sha256":"cfd16504549c0bca9ef0affd26fee406ed42b2e4be3d64c669d2625de6c56beb"}