pith. machine review for the scientific record. sign in

arxiv: 2603.12529 · v2 · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords early stoppingchain-of-thoughtoverthinkinglarge reasoning modelsinference optimizationearly exitreasoning efficiency
0
0 comments X

The pith

Terminator trains a predictor on the first position where a reasoning model outputs its final answer to stop chain-of-thought generation early.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often keep producing chain-of-thought steps long after they have already arrived at the correct answer. The paper constructs training data by recording the exact token position where each model first emits its final answer across many traces. A separate model is then trained to predict these positions from the ongoing generation, allowing inference to halt at that point. Across four datasets the method shortens outputs by 14 to 55 percent on average while preserving accuracy and more than halving latency relative to full generation.

Core claim

Terminator is an early-exit strategy that treats the first arrival of the final answer token as a proxy for the optimal stopping point, builds a dataset of these positions from existing chain-of-thought traces, and trains a lightweight predictor to forecast the same positions on new inputs, thereby shortening reasoning sequences with virtually no accuracy drop on MATH-500, AIME 2025, HumanEval, and GPQA.

What carries the argument

A predictor trained on first-answer positions extracted from chain-of-thought traces to forecast optimal exit points at inference time.

If this is right

  • CoT lengths drop 14-55 percent on average on MATH-500, AIME 2025, HumanEval, and GPQA while accuracy stays the same.
  • Inference latency falls by more than a factor of two compared with standard full-generation runs.
  • The approach outperforms prior state-of-the-art early-stopping baselines on the same four datasets.
  • Optimal exit points can be learned in a task- and model-specific manner without hand-crafted heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same first-answer labeling trick could be reused to create early-exit datasets for other autoregressive tasks such as code synthesis or multi-step planning.
  • If the predictor transfers across model families, it would allow a single stopping model to accelerate many different reasoning systems.
  • One could test whether the learned exit points correlate with human judgments of when a reasoning trace has become redundant.

Load-bearing premise

The first token position at which the model emits its final answer is a reliable proxy for the optimal stopping point and a predictor trained on these positions will generalize without harming accuracy on unseen problems or different models.

What would settle it

Running the trained predictor on a held-out test set and observing a statistically significant drop in final answer accuracy relative to full chain-of-thought generation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.12529 by Alliot Nagle, Ashok Vardhan Makkuva, Dhia Garbaya, Hyeji Kim, Jakhongir Saydaliev, Michael Gastpar.

Figure 1
Figure 1. Figure 1: Early stopping via TERMINATOR. TERMINATOR is a binary probe classifier that predicts whether to exit or not at every CoT token. Once the majority of prediction bits within a window (10 here) are 1, </think> is injected into the LRM’s token stream to stop thinking (Sec. 4). performance gains across many challenging tasks (OpenAI, 2024). However, this improvement does not come for free, as an LRM will genera… view at source ↗
Figure 2
Figure 2. Figure 2: Event-Locked Averaging of Token-Confidence. Event-locked averaging shows a consistent agreement on spiking behavior at the answer position in each CoT, but disagrees elsewhere. On the other hand, this phenomenon is not readily observable in the single-sample case. Figures on the left show the Token-Confidence (Fu et al., 2025b) and log-probability trajectories throughout reasoning for a single, randomly se… view at source ↗
Figure 3
Figure 3. Figure 3: Token Usage Frequency Shift. “Thinking token” usage changes depending on whether the final answer has been generated in the CoT. Rates are computed by counting the raw number of occurrences of the token before and after the answer, and then normalizing each count by the respective number of tokens in the before and after bins. The arrival of the final answer is hinted at by changes in the rates for these t… view at source ↗
Figure 4
Figure 4. Figure 4: Training-Dataset Curation Process. We use an LRM to (1) extract final answer aˆ from final solution s, (2) identify the earliest position of aˆ in the CoT r, and (3) verify that the position was correct. If it was, then we can extract the exact position of aˆ from the CoT at the final token-index extraction step; otherwise, we retry the identification step with feedback. previous methods use much coarser g… view at source ↗
Figure 5
Figure 5. Figure 5: OOD Performance of TERMINATOR. The best trade￾off between accuracy and compression rate is achieved when the evaluation set is in-distribution with the training dataset. Here the out-of-distribution performance of TERMINATOR with respect to the compression rate (left) and the accuracy (right) for Qwen3-8B is shown. Training datasets are listed along the row axis, and the evaluation sets are listed across t… view at source ↗
Figure 7
Figure 7. Figure 7: TERMINATOR Recovers Event-Locked Average Spiking. The exit positions predicted by TERMINATOR (center) recover the same spiking behavior in the event-locked averaged Token-Confidence as the ground-truth answer positions (left). The histogram of differences between the exit positions (right) shows that TERMINATOR’s predicted exit positions are close to the ground-truth. Note that the y-axis on the histogram … view at source ↗
Figure 8
Figure 8. Figure 8: TERMINATOR Token Usage Biases. The exit positions predicted by TERMINATOR recover the same biases in the “thinking token” occurrence rates as the ground-truth answer positions. The inset axes on each panel show the percentage of dots that lie above the diagonal when the ground-truth and TERMINATOR answer positions are used. like another are more frequently used after. Moreover, Figs. 14 to 18 in App. B sho… view at source ↗
Figure 9
Figure 9. Figure 9: Pareto Frontier. TERMINATOR consistently achieves strong Pareto efficiency, reflected by the best accuracy–efficiency tradeoff across models and tasks. Here we plot the Pareto frontiers of accuracy versus compression rate across four reasoning models (Qwen3-8B, Qwen3-14B, Ministral-8B, Ministral-14B) and four benchmarks (MATH-500, AIME25, HumanEval, GPQA). Each point represents a method’s accuracy and comp… view at source ↗
Figure 10
Figure 10. Figure 10: Event-Locked Averaging of Token-Confidence for AIME (1983–2024). A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Event-Locked Averaging of Token-Confidence for MATH. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Event-Locked Averaging of Token-Confidence for OpenCoder-SFT. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Event-Locked Averaging of Token-Confidence for OpenScience. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Token Usage Frequency Shift. An extension of the results shown in [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Token Usage Frequency Shift for AIME (1983–2024. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Token Usage Frequency Shift for MATH. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Token Usage Frequency Shift for OpenCoder-SFT. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Token Usage Frequency Shift for OpenScience. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Token Occurrence Rate vs CoT Length. For some tokens, such as these three “thinking tokens,” occurrence rates decrease rapidly as CoT length increases. Interestingly, the side of the answer (either “before” or “after”) with the highest rate is the one that decays the most (the “before” rate for hmm and okay and the “after” rate for another), while the side with the lower rate sees only a slight decrease o… view at source ↗
Figure 20
Figure 20. Figure 20: Predicted Probabilities Event-Locked Averaging. The dashed vertical line shows where TERMINATOR terminates the CoT with a sliding window of 10 and an exit threshold of 0.7, as indicated by the horizontal dotted line. We show the average predicted probability stream across all test problems from MATH-500, AIME25, HumanEval, and GPQA. Figs. 21 to 24 show predictions streams from individual, randomly drawn s… view at source ↗
Figure 21
Figure 21. Figure 21: Predicted Probabilities for AIME25. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from AIME25. 0 500 1000 1500 2000 2500 3000 3500 4000 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability for Exiting MATH-500 - Sample 398 Predicted Exit Position Exit Threshold 0 250 500 750 1000 1250 1500 1750 0.0 0.2 0.4 0.6 0.8 1.0 MATH-500 - Sample 107 Predicted Exit Position Exi… view at source ↗
Figure 22
Figure 22. Figure 22: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from MATH-500. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Predicted Probabilities for HumanEval. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from HumanEval. 0 250 500 750 1000 1250 1500 1750 2000 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability for Exiting GPQA - Sample 825 Predicted Exit Position Exit Threshold 0 1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 1.0 GPQA - Sample 793 Predicted Exit Position Exit Threshold… view at source ↗
Figure 24
Figure 24. Figure 24: Predicted Probabilities for GPQA. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from GPQA. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Changing the Predictive Threshold. The effect of changing the threshold of predicting 0 or 1 is altered from 0.1 to 0.9. Notably, the compression rate exhibits the largest change from τ “ 0.1 to τ “ 0.9 for all datasets, compared to the change in accuracy. In particular, some datasets, like MATH-500 and HumanEval, exhibit very little performance drop. 0 1000 2000 3000 4000 5000 6000 0 5 10 15 20 Count AIM… view at source ↗
Figure 26
Figure 26. Figure 26: First Answer Occurrence Histogram. A histogram of the first occurrence of the final answer for each data source used in our training dataset is shown. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Compression Ratio Histograms for TERMINATOR. Each histogram shows the frequency of an achieved compression rate when early-exiting with TERMINATOR. Ó indicates that a lower compression rate is better, as it results in more tokens saved with our method. Three CoTs are sampled per data source. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Predicted Probabilities for AIME25. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from AIME-2025. The beginning and the end are truncated for better visibility. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Predicted Probabilities for AIME25. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from AIME-2025. The beginning and the end are truncated for better visibility. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗
read the original abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Terminator, an early-exit strategy for Large Reasoning Models performing Chain-of-Thought reasoning. It constructs training labels from the first token position at which the model emits a final answer (e.g., in boxed format), trains a separate predictor on these positions as proxies for optimal stopping points, and reports 14-55% average CoT length reductions on MATH-500, AIME 2025, HumanEval, and GPQA while outperforming prior methods and cutting latency by more than 2x with virtually no accuracy change.

Significance. If the central claims hold after verification, the work would offer a practical, model-agnostic way to reduce overthinking and inference cost in LRMs on hard reasoning tasks, with clear deployment value for latency-sensitive applications.

major comments (2)
  1. [Abstract] Abstract: the headline guarantee of 'virtually no change in performance' is load-bearing for the contribution, yet the manuscript provides no information on how accuracy equivalence was measured across datasets, whether error bars or statistical tests were used, or how the training set was constructed to avoid leakage from the same LRM whose outputs supply the labels.
  2. [Method] Central method (first-answer-position labeling): treating the first emission of a final answer as the optimal exit label requires explicit empirical verification on held-out prompts that tokens generated after this point never improve correctness; absent such a check, any systematic mismatch between the proxy and true optimality directly undermines the no-accuracy-loss claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline guarantee of 'virtually no change in performance' is load-bearing for the contribution, yet the manuscript provides no information on how accuracy equivalence was measured across datasets, whether error bars or statistical tests were used, or how the training set was constructed to avoid leakage from the same LRM whose outputs supply the labels.

    Authors: We thank the referee for this observation. Accuracy equivalence was evaluated via exact-match on the final answer (e.g., boxed content) between full CoT and early-exit outputs on identical test prompts. In the revised manuscript we will expand the abstract and add an explicit evaluation subsection that reports the precise metric, includes error bars computed over 3 independent runs with different random seeds, and describes the training-set construction: labels were generated on a disjoint collection of prompts (distinct from all evaluation sets) to prevent leakage from the same LRM. revision: yes

  2. Referee: [Method] Central method (first-answer-position labeling): treating the first emission of a final answer as the optimal exit label requires explicit empirical verification on held-out prompts that tokens generated after this point never improve correctness; absent such a check, any systematic mismatch between the proxy and true optimality directly undermines the no-accuracy-loss claim.

    Authors: We agree that a direct empirical check would strengthen the optimality assumption. While our reported results already demonstrate that early exits at the predicted positions preserve accuracy, the manuscript does not contain an explicit ablation confirming that post-first-answer tokens never alter correctness on held-out prompts. We will add this verification in the revised version by continuing generation after the first-answer position on a held-out subset and measuring any change in final-answer correctness; the results and any necessary discussion of the proxy's limitations will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity: labels are observable first-answer positions; performance claims rest on empirical evaluation

full rationale

The paper constructs a dataset of reasoning lengths directly from the observable first token positions at which the LRM emits a final answer during full generation. Terminator is trained as a standard supervised predictor on these labels. The headline performance claims (length reduction with virtually no accuracy change) are supported by direct empirical measurements on held-out examples from MATH-500, AIME 2025, HumanEval, and GPQA rather than by any definitional equivalence or self-referential reduction. No equation or step equates the learned exit prediction to its own training labels by construction, and no load-bearing self-citation or imported uniqueness theorem is invoked. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that first-answer positions are stable across runs and that a small auxiliary model can learn them without additional regularization details.

pith-pipeline@v0.9.0 · 5560 in / 1137 out tokens · 60951 ms · 2026-05-15T11:09:51.842018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    Liu, et al

    URL https://openreview.net/forum? id=MSbU3L7V00. Cheng, J. and Durme, B. V . Compressed chain of thought: Efficient reasoning through dense representations, 2024. URLhttps://arxiv.org/abs/2412.13171. Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceed- ings...

  2. [2]

    emnlp-main.232/

    URL https://aclanthology.org/2023. emnlp-main.232/. Ding, B., Chen, Y ., Wang, F., Ming, L., and Lin, T. Do thinking tokens help or trap? towards more efficient large reasoning model, 2025. URL https://arxiv.org/ abs/2506.23840. Fu, Y ., Chen, J., Zhu, S., Fu, Z., Dai, Z., Zhuang, Y ., Ma, Y ., Qiao, A., Rosing, T., Stoica, I., and Zhang, H. Ef- ficiently...

  3. [3]

    Training Large Language Models to Reason in a Continuous Latent Space

    URL https://openreview.net/forum? id=uREj4ZuGJE. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Lu...

  4. [4]

    URL https://openreview.net/forum? id=9YvfRrpmyw. Jung, H. and Kim, K.-J. Discrete prompt compression with reinforcement learning.IEEE Access, 12:72578–72587,

  5. [5]

    doi: 10.1109/access.2024

    ISSN 2169-3536. doi: 10.1109/access.2024. 3403426. URL http://dx.doi.org/10.1109/ ACCESS.2024.3403426. Kang, Y ., Sun, X., Chen, L., and Zou, W. C3ot: Generating shorter chain-of-thought without compromising effective- ness.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24312–24320, Apr. 2025a. doi: 10. 1609/aaai.v39i23.34608. URL h...

  6. [6]

    URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, A. H., Khandelwal, K., Subramanian, S., Jouault, V ., Rastogi, A., Sadé, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., Sablayrolles, A., Héliou, A., You, A., Ehrenberg, A., Lo, A., Eliseev, A., Calvi, A., Soori- yarachchi, A., Bout, B., Rozière, B., Monicault, B. D., Lanfranchi, C., Barreau, C...

  7. [7]

    URL https: //aclanthology.org/2025.acl-long.300/

    doi: 10.18653/v1/2025.acl-long.300. URL https: //aclanthology.org/2025.acl-long.300/. Mao, M., Yin, B., Zhu, Y ., and Fang, X. Early stopping chain-of-thoughts in large language models, 2025. URL https://arxiv.org/abs/2509.14004. Mu, J., Li, X. L., and Goodman, N. Learning to com- press prompts with gist tokens. InThirty-seventh Conference on Neural Infor...

  8. [8]

    s1: Simple test-time scaling

    URL https://openreview.net/forum? id=2DtxPCL3T5. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393. Nagle, A., Girish, A., Bondaschi, M., Gastpar, M., Makkuva, A. V ., and Kim, H. Fundamental limits o...

  9. [9]

    InFindings of the Association for Computational Linguistics: EMNLP 2025 , Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.)

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 13 TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning 6fac9e316a4ae75ea244ddcef1982c71-Paper-Conference. pdf. Shrivastava, V ., Awadallah, A., Balachandran, V ., Garg, S., Behl, H., and Papailiopoulos, D. Sample more to think less: Group filtered policy...

  10. [13]

    ISBN 979-8-89176-332-6

    URL https://aclanthology.org/2025. emnlp-main.184/. Zhang, J., Zhu, Y ., Sun, M., Luo, Y ., Qiao, S., Du, L., Zheng, D., Chen, H., and Zhang, N. LightThinker: Thinking step-by-step compression. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processi...

  11. [14]

    Extract the final answer from:

    URL https://aclanthology.org/2025. emnlp-main.673/. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Un- locking the reasoning potential of llms in continuous 14 TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning concept space, 2025d. URL https://arxiv.org/ abs/2505....