Learning a Continue-Thinking Token for Enhanced Test-Time Scaling
Pith reviewed 2026-05-19 09:05 UTC · model grok-4.3
The pith
A learned continue-thinking token triggers more effective extended reasoning than fixed overrides like 'Wait'.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By augmenting a distilled version of DeepSeek-R1 with a single learned <|continue-thinking|> token and training only its embedding via reinforcement learning while keeping all model weights frozen, the authors show that this token elicits extended reasoning steps that improve accuracy on standard math benchmarks more than a fixed-token budget-forcing baseline. Where the fixed-token approach already raises accuracy, the learned token delivers markedly larger gains, for instance 4.2 percent absolute improvement on GSM8K versus 1.3 percent from the fixed method.
What carries the argument
The learned <|continue-thinking|> token whose embedding is optimized via reinforcement learning to prompt continued reasoning while the remainder of the model stays frozen.
If this is right
- The learned token produces larger accuracy improvements than fixed-token budget forcing on benchmarks where extra reasoning already helps.
- On GSM8K the absolute gain reaches 4.2 percent over the base model compared with 1.3 percent for the fixed token.
- Test-time scaling is achieved by updating only a single token embedding without changing any model parameters.
- The benefit is clearest precisely in the settings where manually forcing additional steps already yields some improvement.
Where Pith is reading between the lines
- The same single-token training approach could be tried on other base models or on non-math reasoning tasks to test transfer.
- Multiple learned continuation tokens might eventually be trained to control distinct aspects of reasoning behavior.
- If the token encodes a general 'think longer' signal, it could reduce reliance on hand-crafted prompt overrides across different inference budgets.
Load-bearing premise
The accuracy gains come from the token genuinely prompting useful extended reasoning steps rather than from side effects of the reinforcement learning procedure or from the particular reward and benchmark choices.
What would settle it
If inserting the learned token produces no measurable increase in reasoning-chain length or quality compared with the fixed-token baseline on the same problems, yet accuracy still rises, the claim that the token triggers genuinely extended reasoning would be falsified.
Figures
read the original abstract
Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "</think>" with "Wait") can extend reasoning steps and improve accuracy. In this work, we explore whether a dedicated continue-thinking token can be learned to trigger extended reasoning. We augment a distilled version of DeepSeek-R1 with a single learned "<|continue-thinking|>" token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., "Wait") for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model's accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes augmenting a frozen distilled DeepSeek-R1 model with a single learned <|continue-thinking|> token whose embedding is trained via reinforcement learning. It reports that this token yields larger accuracy gains on math reasoning benchmarks than a fixed-token budget-forcing baseline (e.g., 4.2 % vs. 1.3 % absolute improvement on GSM8K relative to the no-budget-forcing base model).
Significance. If the gains are attributable to the learned token inducing longer, higher-quality reasoning traces rather than incidental RL effects, the approach would constitute a low-parameter, model-frozen method for improving test-time scaling. The decision to train only the embedding while freezing all other weights is a clear efficiency strength that merits explicit credit.
major comments (2)
- [Results] Results / GSM8K comparison: the headline 4.2 % vs. 1.3 % absolute gains are stated without error bars, number of runs, or any statistical test. Because the central empirical claim rests on this specific difference, the absence of variance information makes it impossible to judge whether the reported superiority is robust.
- [Method] Method / ablation design: no control is described in which an embedding is trained with the identical RL procedure and reward but is not given the continue-thinking interpretation (e.g., a neutral or randomly initialized token). Without this isolation, it remains possible that the observed improvement arises from generic reward optimization on the embedding rather than from the token learning to trigger extended reasoning.
minor comments (1)
- [Abstract] The abstract refers to “standard math benchmarks” yet only details GSM8K; listing the complete set of evaluated datasets and their corresponding gains would improve completeness.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Results] Results / GSM8K comparison: the headline 4.2 % vs. 1.3 % absolute gains are stated without error bars, number of runs, or any statistical test. Because the central empirical claim rests on this specific difference, the absence of variance information makes it impossible to judge whether the reported superiority is robust.
Authors: We agree that variance information is necessary to assess the robustness of the reported gains. In the revised manuscript, we will perform the experiments across multiple random seeds, report error bars (standard deviation), specify the number of runs, and include a statistical test to evaluate whether the 4.2% improvement is significantly larger than the 1.3% improvement from the fixed-token baseline. revision: yes
-
Referee: [Method] Method / ablation design: no control is described in which an embedding is trained with the identical RL procedure and reward but is not given the continue-thinking interpretation (e.g., a neutral or randomly initialized token). Without this isolation, it remains possible that the observed improvement arises from generic reward optimization on the embedding rather than from the token learning to trigger extended reasoning.
Authors: We recognize the importance of this control to rule out generic effects of RL on the embedding. Our current approach trains the embedding specifically for its role in continuing the thinking process by overriding the end token. To address the referee's point, we will add an ablation in which we train a different embedding using the same RL procedure and reward signal but without linking it to the continue-thinking mechanism, and report the comparative results to show that the performance gains are due to the learned continue-thinking behavior. revision: yes
Circularity Check
No significant circularity; empirical comparisons are independent of fitted inputs
full rationale
The paper describes an empirical procedure: a single token embedding is trained via RL on a frozen distilled DeepSeek-R1 model, then accuracy is measured on GSM8K and other benchmarks against a no-budget-forcing baseline and a fixed-token baseline. These reported deltas (e.g., 4.2 % vs 1.3 %) are direct experimental outcomes, not quantities derived from equations that reduce by construction to the trained embedding or reward signal. No self-definitional steps, fitted-input-called-prediction patterns, or load-bearing self-citations appear in the method or results chain. The evaluation remains falsifiable by re-running the RL training and benchmark measurements under the stated protocol.
Axiom & Free-Parameter Ledger
free parameters (1)
- continue-thinking token embedding
axioms (1)
- domain assumption Reinforcement learning on a single token embedding can produce a signal that extends reasoning steps in a useful way
Reference graph
Works this paper leans on
-
[1]
In Advances in Neural Information Processing Systems
Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe
-
[2]
Let’s verify step by step. In International Conference on Learning Representations. Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, and Li Erran Li. 2025. Deep- scaler: Surpassing o1-preview with a 1.5B model by scaling RL. https://www.notion.so/Deepscal er-Surpassing-o1-preview-wit...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Let’s think dot by dot: Hidden computa- tion in transformer language models. arXiv preprint arXiv:2404.15758. Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. 2025. Scaling test-time compute with- out verification or RL is suboptimal. arXiv preprint arXiv:2502.12118. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Ha...
-
[4]
how much does Lloyd make on eggs per week
Guiding language model reasoning with plan- ning tokens. arXiv preprint arXiv:2310.05707. Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, and 1 others. 2025. Re- inforcement learning for reasoning in large language models with one training example. arXiv preprint arXiv:2504.20571. J...
-
[5]
Calculate weekly egg production: 252 eggs/day * 7 days = 1,764 eggs/week
-
[6]
Convert weekly eggs to dozens: 1,764 eggs / 12 = 147 dozen
-
[7]
Figure 6: GSM8K reasoning trace demonstrating the positive impact of the <|continue_thinking|> token
Calculate weekly revenue: 147 dozen * $2/dozen = $294 Answer: Lloyd makes \boxed{294} dollars per week from his egg sales. Figure 6: GSM8K reasoning trace demonstrating the positive impact of the <|continue_thinking|> token. Blue indicates the original reasoning, yielding an incorrect answer of 7,938. Green shows the continuation after the special token w...
-
[8]
Calculate the total number of eggs produced per day: Eggs per day = Number of chickens × Eggs per chicken per day Eggs per day = 27 × 252 = 6 ,804 eggs
-
[9]
Determine the number of dozens of eggs produced per day: Dozens per day = Eggs per day 12 Dozens per day = 6,804 12 = 567 dozen
-
[10]
Calculate the daily revenue from selling eggs: Revenue per day = Dozens per day × Selling price per dozen Revenue per day = 567 × 2 = 1 ,134 dollars
-
[11]
Find the weekly revenue by multiplying daily revenue by the number of days in a week: Revenue per week = Revenue per day × 7 Revenue per week = 1,134 × 7 = 7 ,938 dollars Final Answer: \boxed{7,938 dollars} Figure 7: GSM8K reasoning trace of the baseline model for the same question as in Figure 6. The final answer provided by the baseline model is incorre...
-
[12]
Calculate the number of shells found by team Alphas: 40% of 700 = 0 .4 × 700 = 280 shells
-
[13]
Subtract the shells found by Alphas from the total:: 700 − 280 = 420 shells remaining
-
[14]
Calculate the number of shells found by team The finders: 60% of 420 = 0 .6 × 420 = 252 shells
-
[15]
Subtract the shells found by The finders from the remaining shells: 420 − 252 = 168 shells Conclusion: Team Gogetters found \boxed{168} shells. Figure 8: GSM8K reasoning trace demonstrating that the <|continue_thinking|> token does not generate many tokens when the model is confident. Blue indicates the original reasoning. Green shows the short continuati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.