{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:XWKVJIXSLVYNKHWXKRVTXQIOU2","short_pith_number":"pith:XWKVJIXS","schema_version":"1.0","canonical_sha256":"bd9554a2f25d70d51ed7546b3bc10ea6987dd4cbd948aa53f779b964a512b7c5","source":{"kind":"arxiv","id":"2506.06941","version":3},"attestation_state":"computed","paper":{"title":"The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.AI","authors_text":"Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Mehrdad Farajtabar, Parshin Shojaee, Samy Bengio","submitted_at":"2025-06-07T22:42:29Z","abstract_excerpt":"Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.06941","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.AI","submitted_at":"2025-06-07T22:42:29Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"a0d32bd599754e05eb9948d06ed7aed1b2cdac8f3f64203a8c1b4e2a57a86a6c","abstract_canon_sha256":"8a45099f14d045accff594ca13ca08c77d46017efad9a353a561b48d2641f330"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.946361Z","signature_b64":"QJojdnMraqS7GGRC/bq8u8DOQyIO2OmylCegjVkwUC5GH0BKZTfQWreUP6tkesPRiMA1gcZJ2fH5SrnJYvReDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"bd9554a2f25d70d51ed7546b3bc10ea6987dd4cbd948aa53f779b964a512b7c5","last_reissued_at":"2026-05-17T23:38:50.945787Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.945787Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.AI","authors_text":"Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Mehrdad Farajtabar, Parshin Shojaee, Samy Bengio","submitted_at":"2025-06-07T22:42:29Z","abstract_excerpt":"Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen controllable puzzle environments provide an unbiased and generalizable measure of reasoning complexity without introducing artifacts that do not appear in other domains such as math or coding.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6013975e7a8b629077629637ac402effe0523f2865e275237dfbdd1c418085d1"},"source":{"id":"2506.06941","kind":"arxiv","version":3},"verdict":{"id":"6892fb93-a504-4e34-a60f-6fe793f4beb0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T16:04:32.572571Z","strongest_claim":"LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget.","one_line_summary":"LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen controllable puzzle environments provide an unbiased and generalizable measure of reasoning complexity without introducing artifacts that do not appear in other domains such as math or coding.","pith_extraction_headline":"Large Reasoning Models exhibit complete accuracy collapse beyond certain complexities and reduce reasoning effort despite available compute."},"references":{"count":55,"sample":[{"doi":"","year":2024,"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","ref_index":1,"cited_arxiv_id":"2412.16720","is_internal_anchor":true},{"doi":"","year":2024,"title":"Introducing openai o1","work_id":"5497958f-45dd-499a-b0c6-ccc8449e45bc","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":3,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2025,"title":"Claude 3.7 sonnet","work_id":"defe6220-1f5a-4678-923b-c0c9c0ef2c4a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Gemini flash thinking.Google AI Blog, Jan 2025","work_id":"94b01b54-7289-4d34-b163-52dcb02e770c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":55,"snapshot_sha256":"f1c8446e766d84e24ffbfbd4a9e5adc5f8c9931d9f64ed559c9257ae17eaa78a","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"62e8360a3887101bdb96469aab1b850550b05d0db1bd52b61d4215d15ae846ce"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.06941","created_at":"2026-05-17T23:38:50.945875+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.06941v3","created_at":"2026-05-17T23:38:50.945875+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.06941","created_at":"2026-05-17T23:38:50.945875+00:00"},{"alias_kind":"pith_short_12","alias_value":"XWKVJIXSLVYN","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"XWKVJIXSLVYNKHWX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"XWKVJIXS","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2605.23007","citing_title":"MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08571","citing_title":"Robust Reasoning Benchmark","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23108","citing_title":"Artificial Phantasia: Emergent Mental Imagery in Large Language Models","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26745","citing_title":"Deep sequence models tend to memorize geometrically; it is unclear why","ref_index":164,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18814","citing_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18302","citing_title":"What Would GPT Click: Practical Effects of Human-AI Behavioral Misalignment and the Cost of Synthetic Participants in User Experience","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.01191","citing_title":"Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2508.04691","citing_title":"Before Humans Join the Team: Diagnosing Coordination Failures in Healthcare Robot Team Simulation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2508.16745","citing_title":"Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21882","citing_title":"Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18184","citing_title":"ActivationReasoning: Logical Reasoning in Latent Activation Spaces","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2510.18814","citing_title":"A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2510.25426","citing_title":"Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09629","citing_title":"End-to-end PDDL Planning with Hardcoded and Dynamic Agents","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2601.19924","citing_title":"OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2603.06870","citing_title":"LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09634","citing_title":"From Understanding to Creation: A Prerequisite-Free AI Literacy Course with Technical Depth Across Majors","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14262","citing_title":"Distill: Uncovering the True Intent behind Human-Robot Communication","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22816","citing_title":"Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08571","citing_title":"Robust Reasoning Benchmark","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27343","citing_title":"WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12421","citing_title":"Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09678","citing_title":"Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09519","citing_title":"Weighted Rules under the Stable Model Semantics","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25506","citing_title":"Assistants, Not Architects: The Role of LLMs in Networked Systems Design","ref_index":49,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2","json":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2.json","graph_json":"https://pith.science/api/pith-number/XWKVJIXSLVYNKHWXKRVTXQIOU2/graph.json","events_json":"https://pith.science/api/pith-number/XWKVJIXSLVYNKHWXKRVTXQIOU2/events.json","paper":"https://pith.science/paper/XWKVJIXS"},"agent_actions":{"view_html":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2","download_json":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2.json","view_paper":"https://pith.science/paper/XWKVJIXS","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.06941&json=true","fetch_graph":"https://pith.science/api/pith-number/XWKVJIXSLVYNKHWXKRVTXQIOU2/graph.json","fetch_events":"https://pith.science/api/pith-number/XWKVJIXSLVYNKHWXKRVTXQIOU2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2/action/storage_attestation","attest_author":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2/action/author_attestation","sign_citation":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2/action/citation_signature","submit_replication":"https://pith.science/pith/XWKVJIXSLVYNKHWXKRVTXQIOU2/action/replication_record"}},"created_at":"2026-05-17T23:38:50.945875+00:00","updated_at":"2026-05-17T23:38:50.945875+00:00"}