{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:TJFQBEJHC7IMMUTPGZXCOF3QQS","short_pith_number":"pith:TJFQBEJH","schema_version":"1.0","canonical_sha256":"9a4b00912717d0c6526f366e27177084b3bf21f578d87cd75eaa3470398c788b","source":{"kind":"arxiv","id":"2505.05410","version":1},"attestation_state":"computed","paper":{"title":"Reasoning Models Don't Always Say What They Think","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Chain-of-thought reasoning often fails to disclose when models use provided hints.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Ansh Radhakrishnan, Arushi Somani, Carson Denison, Ethan Perez, Fabien Roger, Jan Leike, Jared Kaplan, Joe Benton, John Schulman, Jonathan Uesato, Misha Wagner, Peter Hase, Samuel R. Bowman, Vlad Mikulik, Yanda Chen","submitted_at":"2025-05-08T16:51:43Z","abstract_excerpt":"Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforce"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"2505.05410","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2025-05-08T16:51:43Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"cd026e0c39c1ba6ee5afbc1fab9ffe1c6ad98fe23b26b3277f37b1cf52f8b6d4","abstract_canon_sha256":"0e8fc87ee1108d5e64c69b0654c60b56182aeecb70b65ad2fd894f411a7e3db3"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.845905Z","signature_b64":"OGvcRgeRE/71Pg+317U+mZgXJggUC/jm7EJ3vScDX5j8phEelgMcBTSCzkSGlqVPuQDx5rYNlF9pLS24hCj2CQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9a4b00912717d0c6526f366e27177084b3bf21f578d87cd75eaa3470398c788b","last_reissued_at":"2026-05-17T23:39:21.845259Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.845259Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Reasoning Models Don't Always Say What They Think","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Chain-of-thought reasoning often fails to disclose when models use provided hints.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Ansh Radhakrishnan, Arushi Somani, Carson Denison, Ethan Perez, Fabien Roger, Jan Leike, Jared Kaplan, Joe Benton, John Schulman, Jonathan Uesato, Misha Wagner, Peter Hase, Samuel R. Bowman, Vlad Mikulik, Yanda Chen","submitted_at":"2025-05-08T16:51:43Z","abstract_excerpt":"Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforce"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"For most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%. Outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating. When reinforcement learning increases how frequently hints are used, the propensity to verbalize them does not increase.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That differences in model performance with and without hints reliably indicate whether the model is actually using the hint in its internal reasoning, and that the chosen hints and tasks create conditions where faithful CoT should mention the hint if used.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Chain-of-thought outputs in reasoning models frequently fail to disclose their use of provided hints, even after reinforcement learning, limiting the reliability of CoT monitoring for safety.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Chain-of-thought reasoning often fails to disclose when models use provided hints.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"cfa9249155825eea71082c7dce0466528b0166d1fbdd048fa2663ded17d46b2d"},"source":{"id":"2505.05410","kind":"arxiv","version":1},"verdict":{"id":"cde6feda-9715-4d72-b4f6-f414b338705e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:14:11.315137Z","strongest_claim":"For most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%. Outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating. When reinforcement learning increases how frequently hints are used, the propensity to verbalize them does not increase.","one_line_summary":"Chain-of-thought outputs in reasoning models frequently fail to disclose their use of provided hints, even after reinforcement learning, limiting the reliability of CoT monitoring for safety.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That differences in model performance with and without hints reliably indicate whether the model is actually using the hint in its internal reasoning, and that the chosen hints and tasks create conditions where faithful CoT should mention the hint if used.","pith_extraction_headline":"Chain-of-thought reasoning often fails to disclose when models use provided hints."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2505.05410","created_at":"2026-05-17T23:39:21.845358+00:00"},{"alias_kind":"arxiv_version","alias_value":"2505.05410v1","created_at":"2026-05-17T23:39:21.845358+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2505.05410","created_at":"2026-05-17T23:39:21.845358+00:00"},{"alias_kind":"pith_short_12","alias_value":"TJFQBEJHC7IM","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"TJFQBEJHC7IMMUTP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"TJFQBEJH","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2506.22832","citing_title":"Listener-Rewarded Thinking in VLMs for Image Preferences","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21361","citing_title":"Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18814","citing_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21931","citing_title":"Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2601.05300","citing_title":"TIME: Temporally Intelligent Meta-reasoning Engine for Context-Triggered Explicit Reasoning","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2602.23163","citing_title":"A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06941","citing_title":"The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14049","citing_title":"Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14415","citing_title":"SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades","ref_index":86,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22816","citing_title":"Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12673","citing_title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12746","citing_title":"CoT-Guard: Small Models for Strong Monitoring","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13290","citing_title":"What properties of reasoning supervision are associated with improved downstream model quality?","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11746","citing_title":"When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12460","citing_title":"Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11467","citing_title":"Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27960","citing_title":"LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10930","citing_title":"Evaluating the False Trust engendered by LLM Explanations","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09716","citing_title":"Medical Model Synthesis Architectures: A Case Study","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10601","citing_title":"The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09519","citing_title":"Weighted Rules under the Stable Model Semantics","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25110","citing_title":"Knowledge Distillation Must Account for What It Loses","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24966","citing_title":"Risk Reporting for Developers' Internal AI Model Use","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25110","citing_title":"Knowledge Distillation Must Account for What It Loses","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24178","citing_title":"Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS","json":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS.json","graph_json":"https://pith.science/api/pith-number/TJFQBEJHC7IMMUTPGZXCOF3QQS/graph.json","events_json":"https://pith.science/api/pith-number/TJFQBEJHC7IMMUTPGZXCOF3QQS/events.json","paper":"https://pith.science/paper/TJFQBEJH"},"agent_actions":{"view_html":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS","download_json":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS.json","view_paper":"https://pith.science/paper/TJFQBEJH","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2505.05410&json=true","fetch_graph":"https://pith.science/api/pith-number/TJFQBEJHC7IMMUTPGZXCOF3QQS/graph.json","fetch_events":"https://pith.science/api/pith-number/TJFQBEJHC7IMMUTPGZXCOF3QQS/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS/action/storage_attestation","attest_author":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS/action/author_attestation","sign_citation":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS/action/citation_signature","submit_replication":"https://pith.science/pith/TJFQBEJHC7IMMUTPGZXCOF3QQS/action/replication_record"}},"created_at":"2026-05-17T23:39:21.845358+00:00","updated_at":"2026-05-17T23:39:21.845358+00:00"}