Stateful Reasoning via Insight Replay

Ang Li; Bin Lei; Caiwen Ding; Jiachen Yang; Xin Eric Wang

arxiv: 2605.14457 · v2 · pith:EWJ3PILCnew · submitted 2026-05-14 · 💻 cs.AI

Stateful Reasoning via Insight Replay

Bin Lei , Caiwen Ding , Jiachen Yang , Ang Li , Xin Eric Wang This is my paper

Pith reviewed 2026-05-20 21:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords chain-of-thoughtreasoninginsight replaylarge language modelstest-time scalingattentionstateful reasoning

0 comments

The pith

Periodically extracting and replaying critical insights keeps them accessible in long reasoning traces and raises accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chain-of-thought reasoning loses effectiveness once traces grow long because models gradually lose attention to key insights generated early on. The paper proposes InsightReplay as a fix: the model periodically pulls out those insights and replays them near the current point in its generation. Experiments on a grid of two model scales, three families, and four benchmarks show gains in every one of the 24 settings, averaging 1.65 points and reaching 9.2 points in the best case. A sympathetic reader would care because the result reframes test-time scaling as an access problem rather than a pure length problem.

Core claim

As chain-of-thought length increases, attention to critical early insights weakens and accuracy eventually declines. InsightReplay counters this by having the model extract those insights at intervals and replay them near the active generation frontier so they remain accessible for later steps. The approach produces accuracy gains across all 24 tested combinations of models and tasks.

What carries the argument

InsightReplay, a stateful method in which the model periodically extracts critical insights from its growing reasoning trace and replays them near the current generation frontier.

If this is right

Test-time scaling benefits depend on preserving access to intermediate insights, not merely extending trace length.
The non-monotonic accuracy pattern with longer CoT can be mitigated by periodic replay.
Gains appear consistently across model scales from 8B to 30B and across math and coding benchmarks.
Three rounds of insight extraction and replay suffice to produce measurable improvements in the reported settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay idea could be applied to other long-context generation tasks where early decisions affect later accuracy.
If the extraction step is made more robust, for example by cross-checking extracted insights, further gains may be possible beyond the 3-round schedule tested.
The method might reduce the total compute needed for a given accuracy target by making shorter traces more effective.

Load-bearing premise

The model can reliably extract only the truly critical insights from its own trace without adding noise or omitting load-bearing facts.

What would settle it

Measure accuracy after replacing the extracted insights with random or irrelevant statements from the same trace; if the gains disappear, the claim that specific critical insights drive the improvement would be falsified.

read the original abstract

Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies weakening attention to early critical insights as a key limitation in long Chain-of-Thought traces, proposes InsightReplay to periodically extract and replay those insights near the active frontier, and reports that three rounds of this procedure produce accuracy gains over standard CoT in every one of 24 experimental settings (model scales 8B/30B, three families, four benchmarks), with an average improvement of +1.65 points and a peak gain of +9.2 points.

Significance. Should the gains prove attributable to the replay of verified critical insights rather than prompting overhead or token count, the work would offer a practical, stateful mechanism for sustaining access to early reasoning steps during extended test-time scaling. The breadth of the 2×3×4 grid supplies unusually wide empirical coverage for a single method.

major comments (2)

[Section 3] Section 3 (InsightReplay method): the central explanatory claim—that periodic extraction and replay mitigates attention decay—requires that the extracted statements are both accurate and load-bearing. No ablation, human rating, or comparison against random or generic extraction is reported, so the +1.65 average gain cannot yet be confidently attributed to the proposed mechanism rather than to the additional extraction prompts or altered generation length.
[Section 4] Section 4 (Experiments and Table 1): accuracy deltas are presented without error bars, standard deviations, or multi-seed statistics. For the smaller gains that constitute most of the 24 settings, it is therefore impossible to assess whether the reported improvements are statistically reliable or could be explained by run-to-run variance.

minor comments (1)

[Abstract] The abstract states a 2×3×4 grid yielding 24 settings but does not enumerate the exact combination of model sizes, families, and benchmarks; an explicit listing or reference to the corresponding table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where stronger evidence for mechanism attribution and statistical robustness would improve the work. We respond to each major comment below and will incorporate revisions to address them.

read point-by-point responses

Referee: [Section 3] Section 3 (InsightReplay method): the central explanatory claim—that periodic extraction and replay mitigates attention decay—requires that the extracted statements are both accurate and load-bearing. No ablation, human rating, or comparison against random or generic extraction is reported, so the +1.65 average gain cannot yet be confidently attributed to the proposed mechanism rather than to the additional extraction prompts or altered generation length.

Authors: We agree that the current experiments do not fully isolate the contribution of replaying load-bearing critical insights from possible effects of extra prompting or token budget. The uniform gains across all 24 settings make a pure overhead explanation unlikely, yet direct controls are needed. In the revised manuscript we will add an ablation replacing extracted insights with randomly sampled statements from the same trace and a second control that injects equivalent additional tokens via generic prompts without insight extraction. We will also report human ratings of accuracy and relevance for a sample of extracted insights. These results will appear in a new subsection of Section 3 and an expanded appendix. revision: yes
Referee: [Section 4] Section 4 (Experiments and Table 1): accuracy deltas are presented without error bars, standard deviations, or multi-seed statistics. For the smaller gains that constitute most of the 24 settings, it is therefore impossible to assess whether the reported improvements are statistically reliable or could be explained by run-to-run variance.

Authors: We acknowledge that the lack of variance estimates limits evaluation of the smaller deltas. To remedy this we have rerun all 24 settings with three independent random seeds and will replace the single-run numbers in Table 1 with means and standard deviations. The updated table will also note the seed count, allowing readers to judge whether observed improvements exceed typical generation variance. Larger gains (e.g., +9.2) remain clearly outside the range of run-to-run fluctuation even under the new statistics. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation of InsightReplay is self-contained with no circular derivation

full rationale

The paper proposes InsightReplay to address attention decay in long CoT traces and validates it via direct experiments on a 2x3x4 grid of models and benchmarks. The central claims consist of measured accuracy deltas (+1.65 average, up to +9.2) against standard CoT on held-out data. No equations, fitted parameters, or self-citations are invoked as load-bearing steps that reduce the result to its own inputs by construction. The method description and experimental protocol stand independently of any prior self-work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that LLMs can perform reliable self-extraction of insights and that replaying them improves attention without side effects; no new physical or mathematical entities are introduced.

free parameters (1)

number of replay rounds
Fixed at 3 in the reported experiments; chosen to balance overhead and benefit.

axioms (1)

domain assumption LLMs can accurately identify critical insights from their own partial reasoning trace
Invoked when the model is prompted to extract insights periodically.

pith-pipeline@v0.9.0 · 5871 in / 1291 out tokens · 56199 ms · 2026-05-20T21:32:48.744903+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[2]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[3]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[4]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

work page 2022
[5]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

work page arXiv 2025
[6]

arXiv preprint arXiv:2602.13517 , year=

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

work page arXiv 2026
[7]

Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417, 2025

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, et al. Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417, 2025

work page arXiv 2025
[8]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143

work page arXiv 2025
[9]

Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

work page 1997
[10]

Tokenskip: Controllable chain-of-thought compression in llms

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3351–3363, 2025

work page 2025
[11]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025
[12]

D., Yu, Z., Xu, X., Qi, W., and Chen, K

Hang Yuan, Bin Yu, Haotian Li, Shijun Yang, Christina Dan Wang, Zhou Yu, Xueyin Xu, Weizhen Qi, and Kai Chen. Not all tokens are what you need in thinking.arXiv preprint arXiv:2505.17827, 2025

work page arXiv 2025
[13]

Making slow thinking faster: Compressing llm chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing llm chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[14]

Inftythink: Breaking the length limits of long-context reasoning in large language models.arXiv preprint arXiv:2503.06692, 2025

YuchenYan, YongliangShen, YangLiu, JinJiang, MengdiZhang, JianShao, andYuetingZhuang. Inftythink: Breaking the length limits of long-context reasoning in large language models.arXiv preprint arXiv:2503.06692, 2025. 12 Stateful Reasoning via Insight Replay

work page arXiv 2025
[15]

Pencil: Long thoughts with short memory, 2025

Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. Pencil: Long thoughts with short memory.arXiv preprint arXiv:2503.14337, 2025

work page arXiv 2025
[16]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhut- dinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

work page arXiv 2025
[17]

arXiv preprint arXiv:2411.19943 , year =

Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability.arXiv preprint arXiv:2411.19943, 2024

work page arXiv 2024
[18]

Maa invitational competitions.https://maa.org/ma a-invitational-competitions/, 2025

Mathematical Association of America. Maa invitational competitions.https://maa.org/ma a-invitational-competitions/, 2025. Accessed: 2026-04-10

work page 2025
[19]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/ab s/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. Round and round we go! what makes rotary positional encodings useful?, 2025. URLhttps://arxiv.org/abs/2410.06205

work page arXiv 2025
[21]

Qwen3.5.https://huggingface.co/collections/Qwen/qwen35 , 2026

Qwen Team. Qwen3.5.https://huggingface.co/collections/Qwen/qwen35 , 2026. Accessed: 2026-05-03

work page 2026
[22]

Gemma 4.https://deepmind.google/models/g emma/gemma-4/, 2026

Gemma Team and Google DeepMind. Gemma 4.https://deepmind.google/models/g emma/gemma-4/, 2026. Accessed: 2026-05-03

work page 2026
[23]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

MislavBalunović, JasperDekoninck, IvoPetrov, NikolaJovanović, andMartinVechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty Zhang, Richard Ma, Jieyu Wang, Dawen Ford, Nikhil Shah, Tianyi Zhou, Vladimir Braverman, et al. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024
[25]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

Naman Jain et al. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

work page 2024
[26]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

work page 2026
[30]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 13 Stateful Reasoning via Insight Replay

work page 2023
[31]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. A. Experimental Details for Property I Dataset.We use60problems from the American Invitational Mathematics Examination (AIME) spanning the2024and2025c...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

The number of subsets𝐵⊆𝐴such that lcm(𝐵)=2025is 𝐾=2 15 −2 12 −2 10 +2 8 =256·105

work page
[33]

Before finalizing, my current working answer is 233

The resulting probability is𝑚 𝑛 = 105 128, where𝑚and𝑛are relatively prime positive integers. Before finalizing, my current working answer is 233. Let me verify each of these conclusions and check whether they actually support this answer –- or whether I’ve missed something that would change it. (The model now continues from this point, still inside the sa...

work page
[34]

short” bin would be dominated by easy problems, while a “long

Then 𝑚= 109, 𝑛= 128, andgcd(𝑚, 𝑛)=1since109is prime and128=2 7. Therefore𝑚+𝑛=109+128=237. </think> <Answer>237</Answer> The verification pass catches and corrects the arithmetic slip from Pass 1, flipping the answer from 23 Stateful Reasoning via Insight Replay the incorrect233to the correct237. The original reasoning chain was almost entirely sound, and ...

work page 2025
[35]

The full solution is split into logical reasoning steps (up to8steps)

work page
[36]

Each reasoning step is summarized into a concise insight (1–2 sentences, up to256tokens) via a separate model call

work page
[37]

This process yields1,892valid cases and8,102SFT training entries

Multi-turn SFT entries are constructed so that each round contains a reasoning step in<think> ...</think> followed by an intermediate conclusion in<finding>...</finding> tags, with the final round producing the answer. This process yields1,892valid cases and8,102SFT training entries. The average number of insight rounds per problem is3.3, with a maximum o...

work page 2025

[1] [1]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[2] [2]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[3] [3]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[4] [4]

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35: 22199–22213, 2022

work page 2022

[5] [5]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

work page arXiv 2025

[6] [6]

arXiv preprint arXiv:2602.13517 , year=

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens.arXiv preprint arXiv:2602.13517, 2026

work page arXiv 2026

[7] [7]

Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417, 2025

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, et al. Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417, 2025

work page arXiv 2025

[8] [8]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. Thought anchors: Which llm reasoning steps matter?, 2025. URLhttps://arxiv.org/abs/2506.19143

work page arXiv 2025

[9] [9]

Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

work page 1997

[10] [10]

Tokenskip: Controllable chain-of-thought compression in llms

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3351–3363, 2025

work page 2025

[11] [11]

C3ot: Generating shorter chain-of- thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of- thought without compromising effectiveness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24312–24320, 2025

work page 2025

[12] [12]

D., Yu, Z., Xu, X., Qi, W., and Chen, K

Hang Yuan, Bin Yu, Haotian Li, Shijun Yang, Christina Dan Wang, Zhou Yu, Xueyin Xu, Weizhen Qi, and Kai Chen. Not all tokens are what you need in thinking.arXiv preprint arXiv:2505.17827, 2025

work page arXiv 2025

[13] [13]

Making slow thinking faster: Compressing llm chain-of-thought via step entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, and Qiang Xu. Making slow thinking faster: Compressing llm chain-of-thought via step entropy. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[14] [14]

Inftythink: Breaking the length limits of long-context reasoning in large language models.arXiv preprint arXiv:2503.06692, 2025

YuchenYan, YongliangShen, YangLiu, JinJiang, MengdiZhang, JianShao, andYuetingZhuang. Inftythink: Breaking the length limits of long-context reasoning in large language models.arXiv preprint arXiv:2503.06692, 2025. 12 Stateful Reasoning via Insight Replay

work page arXiv 2025

[15] [15]

Pencil: Long thoughts with short memory, 2025

Chenxiao Yang, Nathan Srebro, David McAllester, and Zhiyuan Li. Pencil: Long thoughts with short memory.arXiv preprint arXiv:2503.14337, 2025

work page arXiv 2025

[16] [16]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhut- dinov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2411.19943 , year =

Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Lin, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability.arXiv preprint arXiv:2411.19943, 2024

work page arXiv 2024

[18] [18]

Maa invitational competitions.https://maa.org/ma a-invitational-competitions/, 2025

Mathematical Association of America. Maa invitational competitions.https://maa.org/ma a-invitational-competitions/, 2025. Accessed: 2026-04-10

work page 2025

[19] [19]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/ab s/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. Round and round we go! what makes rotary positional encodings useful?, 2025. URLhttps://arxiv.org/abs/2410.06205

work page arXiv 2025

[21] [21]

Qwen3.5.https://huggingface.co/collections/Qwen/qwen35 , 2026

Qwen Team. Qwen3.5.https://huggingface.co/collections/Qwen/qwen35 , 2026. Accessed: 2026-05-03

work page 2026

[22] [22]

Gemma 4.https://deepmind.google/models/g emma/gemma-4/, 2026

Gemma Team and Google DeepMind. Gemma 4.https://deepmind.google/models/g emma/gemma-4/, 2026. Accessed: 2026-05-03

work page 2026

[23] [23]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

MislavBalunović, JasperDekoninck, IvoPetrov, NikolaJovanović, andMartinVechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty Zhang, Richard Ma, Jieyu Wang, Dawen Ford, Nikhil Shah, Tianyi Zhou, Vladimir Braverman, et al. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024

work page 2024

[25] [25]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

Naman Jain et al. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

work page 2024

[26] [26]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025

[28] [28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

work page 2026

[30] [30]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 13 Stateful Reasoning via Insight Replay

work page 2023

[31] [31]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. A. Experimental Details for Property I Dataset.We use60problems from the American Invitational Mathematics Examination (AIME) spanning the2024and2025c...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

The number of subsets𝐵⊆𝐴such that lcm(𝐵)=2025is 𝐾=2 15 −2 12 −2 10 +2 8 =256·105

work page

[33] [33]

Before finalizing, my current working answer is 233

The resulting probability is𝑚 𝑛 = 105 128, where𝑚and𝑛are relatively prime positive integers. Before finalizing, my current working answer is 233. Let me verify each of these conclusions and check whether they actually support this answer –- or whether I’ve missed something that would change it. (The model now continues from this point, still inside the sa...

work page

[34] [34]

short” bin would be dominated by easy problems, while a “long

Then 𝑚= 109, 𝑛= 128, andgcd(𝑚, 𝑛)=1since109is prime and128=2 7. Therefore𝑚+𝑛=109+128=237. </think> <Answer>237</Answer> The verification pass catches and corrects the arithmetic slip from Pass 1, flipping the answer from 23 Stateful Reasoning via Insight Replay the incorrect233to the correct237. The original reasoning chain was almost entirely sound, and ...

work page 2025

[35] [35]

The full solution is split into logical reasoning steps (up to8steps)

work page

[36] [36]

Each reasoning step is summarized into a concise insight (1–2 sentences, up to256tokens) via a separate model call

work page

[37] [37]

This process yields1,892valid cases and8,102SFT training entries

Multi-turn SFT entries are constructed so that each round contains a reasoning step in<think> ...</think> followed by an intermediate conclusion in<finding>...</finding> tags, with the final round producing the answer. This process yields1,892valid cases and8,102SFT training entries. The average number of insight rounds per problem is3.3, with a maximum o...

work page 2025