{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2026:2B5PEQVXTKYX3R2TVN3PC6H5HL","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"37c3c24371f40bd0284ed88b67efea9857f9c52d33f4326886c00c20cf00248b","cross_cats_sorted":[],"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:09:25Z","title_canon_sha256":"6ce10d1245f7e0387396c35dd485891841c0774ad727828724568f537aedb970"},"schema_version":"1.0","source":{"id":"2605.14186","kind":"arxiv","version":1}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2605.14186","created_at":"2026-05-17T23:39:11Z"},{"alias_kind":"arxiv_version","alias_value":"2605.14186v1","created_at":"2026-05-17T23:39:11Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.14186","created_at":"2026-05-17T23:39:11Z"},{"alias_kind":"pith_short_12","alias_value":"2B5PEQVXTKYX","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_16","alias_value":"2B5PEQVXTKYX3R2T","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_8","alias_value":"2B5PEQVX","created_at":"2026-05-18T12:33:37Z"}],"graph_snapshots":[{"event_id":"sha256:677148e7cd1d907d4f284d0a50e0782be827e64494bedf5c487e5804c2ec7d20","target":"graph","created_at":"2026-05-17T23:39:11Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"That the pre-solve feeling-of-knowing and post-solve judgment-of-learning signals elicited from the LLM are sufficiently reliable, consistent, and actionable to serve as effective control inputs for trust/retry/aggregate decisions without introducing systematic bias or new failure modes."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"Large language models can use their own pre- and post-solution self-assessments to control inference and raise accuracy on reasoning tasks without any training or fine-tuning."}],"snapshot_sha256":"cc627bc89baed6e31bc258274f5a6ae2738e4e6f747d49229a9aad6f53411097"},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ebec385d37de0b619bbb5494c5c5e8d53b93fb63232467c34813fc672c342186"},"paper":{"abstract_excerpt":"Large language models (LLMs) often expose useful signals of self-monitoring: before solving a problem, they can estimate whether they are likely to succeed, and after solving it, they can judge whether their answer is likely to be correct. However, these signals are typically measured or elicited in isolation, rather than used to control inference. In this work, we ask whether LLMs possess latent metacognitive ability that can be turned into effective test-time control. Inspired by the Nelson--Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring","authors_text":"Peijia Qin, Pengtao Xie, Qi Cao, Shuhao Zhang, Yufan Wang","cross_cats":[],"headline":"Large language models can use their own pre- and post-solution self-assessments to control inference and raise accuracy on reasoning tasks without any training or fine-tuning.","license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:09:25Z","title":"LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling"},"references":{"count":61,"internal_anchors":11,"resolved_work":61,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Harness engineering: Leveraging codex in an agent-first world","work_id":"73e88b1e-01de-4277-a848-7c7a3d0361ae","year":2026},{"cited_arxiv_id":"2303.08774","doi":"","is_internal_anchor":true,"ref_index":2,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","year":2023},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":3,"title":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances","work_id":"c4a52ad0-3036-4eb4-bbde-41fa3ef37131","year":2022},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":4,"title":"Scaling llm test-time compute optimally can be more effective than scaling model parameters for reasoning","work_id":"78506b11-0537-4d98-922a-fa1f132f98dc","year":2025},{"cited_arxiv_id":"2308.12950","doi":"","is_internal_anchor":true,"ref_index":5,"title":"Code Llama: Open Foundation Models for Code","work_id":"e73bffa4-7620-47ac-9327-259a60db52ca","year":2023}],"snapshot_sha256":"b4b68dc1e7b3d64971a94acca001cdd77062bc9cf7defb2581a03d57e6e12bb7"},"source":{"id":"2605.14186","kind":"arxiv","version":1},"verdict":{"created_at":"2026-05-15T04:46:21.059145Z","id":"53c16236-a877-47f2-a614-071a71ddbcf7","model_set":{"reader":"grok-4.3"},"one_line_summary":"A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"Large language models can use their own pre- and post-solution self-assessments to control inference and raise accuracy on reasoning tasks without any training or fine-tuning.","strongest_claim":"Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V.","weakest_assumption":"That the pre-solve feeling-of-knowing and post-solve judgment-of-learning signals elicited from the LLM are sufficiently reliable, consistent, and actionable to serve as effective control inputs for trust/retry/aggregate decisions without introducing systematic bias or new failure modes."}},"verdict_id":"53c16236-a877-47f2-a614-071a71ddbcf7"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:ff0b249682437a2949c5ebb5ba4e4f8427720492cdc055119a6696c12f4f1418","target":"record","created_at":"2026-05-17T23:39:11Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"37c3c24371f40bd0284ed88b67efea9857f9c52d33f4326886c00c20cf00248b","cross_cats_sorted":[],"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2026-05-13T23:09:25Z","title_canon_sha256":"6ce10d1245f7e0387396c35dd485891841c0774ad727828724568f537aedb970"},"schema_version":"1.0","source":{"id":"2605.14186","kind":"arxiv","version":1}},"canonical_sha256":"d07af242b79ab17dc753ab76f178fd3afb2892ecd26659597bfd6e4bacd2043a","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"d07af242b79ab17dc753ab76f178fd3afb2892ecd26659597bfd6e4bacd2043a","first_computed_at":"2026-05-17T23:39:11.197499Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-17T23:39:11.197499Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"QQYS60aRkCw2g4jMpzQ1NS3Ravj9KiCd0a22lcuE2JXDbG9EtEsSkaW+lbxMfmIOR6/cnBf3JOk6wh8HW5+cBQ==","signature_status":"signed_v1","signed_at":"2026-05-17T23:39:11.198030Z","signed_message":"canonical_sha256_bytes"},"source_id":"2605.14186","source_kind":"arxiv","source_version":1}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:ff0b249682437a2949c5ebb5ba4e4f8427720492cdc055119a6696c12f4f1418","sha256:677148e7cd1d907d4f284d0a50e0782be827e64494bedf5c487e5804c2ec7d20"],"state_sha256":"d8451e7f4851d66baebe17e00ff8211c4a30b57033bcd6b112b7c86d0139f246"}