{"paper":{"title":"Stateful Reasoning via Insight Replay","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Replaying critical insights from earlier in a reasoning trace keeps them accessible and improves accuracy as chains lengthen in large language models.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Ang Li, Bin Lei, Caiwen Ding, Jiachen Yang, Xin Eric Wang","submitted_at":"2026-05-14T06:52:59Z","abstract_excerpt":"Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they a"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"3-round InsightReplay yields accuracy gains across all 24 settings, with an averaged improvement of +1.65 points over standard CoT, and a largest single-setting gain of +9.2 points on R1-Distill-32B's LiveCodeBench v5 subset.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the model can reliably extract truly critical insights without introducing noise or errors, and that replaying them will restore accessibility without disrupting the ongoing reasoning flow.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"InsightReplay improves LLM accuracy on reasoning benchmarks by extracting and replaying critical insights to maintain their accessibility during extended chain-of-thought generation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Replaying critical insights from earlier in a reasoning trace keeps them accessible and improves accuracy as chains lengthen in large language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5d616a79046304b9c8d7c649f194be86cd1ec7d3ab6967f11dd5a681a3d99935"},"source":{"id":"2605.14457","kind":"arxiv","version":1},"verdict":{"id":"a18e5c72-95d6-4958-95ba-58d245d5811d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T01:37:03.082135Z","strongest_claim":"3-round InsightReplay yields accuracy gains across all 24 settings, with an averaged improvement of +1.65 points over standard CoT, and a largest single-setting gain of +9.2 points on R1-Distill-32B's LiveCodeBench v5 subset.","one_line_summary":"InsightReplay improves LLM accuracy on reasoning benchmarks by extracting and replaying critical insights to maintain their accessibility during extended chain-of-thought generation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the model can reliably extract truly critical insights without introducing noise or errors, and that replaying them will restore accessibility without disrupting the ongoing reasoning flow.","pith_extraction_headline":"Replaying critical insights from earlier in a reasoning trace keeps them accessible and improves accuracy as chains lengthen in large language models."},"references":{"count":37,"sample":[{"doi":"","year":2022,"title":"Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837","work_id":"3256b5a6-76d9-460c-ad77-5d232058ad6d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Self-consistency improves chain of thought reasoning in language models","work_id":"bc8f4e8c-9c0a-4b85-8532-6489ef6b2d1c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638","work_id":"f01253de-58e9-4b0f-a91e-4ef797517a24","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213","work_id":"f69f3f9c-f092-49b0-b244-16b6480f5a02","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"When more is less: Understanding chain-of-thought length in llms","work_id":"450a518d-6378-4944-9651-8e3ef8049ddf","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":37,"snapshot_sha256":"7fe9431531b7d685697d2fb2b44b6a06ea83eafa258a7cdc5f8e4d20a81f320b","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ffc630b03d1266ec44cbb25299b9d66d1fd6dc53e47df7df2f2e81917d3e240e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}