arxiv: 2603.12529 · v2 · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle , Jakhongir Saydaliev , Dhia Garbaya , Michael Gastpar , Ashok Vardhan Makkuva , Hyeji Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords early stoppingchain-of-thoughtoverthinkinglarge reasoning modelsinference optimizationearly exitreasoning efficiency

0 comments

The pith

Terminator trains a predictor on the first position where a reasoning model outputs its final answer to stop chain-of-thought generation early.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often keep producing chain-of-thought steps long after they have already arrived at the correct answer. The paper constructs training data by recording the exact token position where each model first emits its final answer across many traces. A separate model is then trained to predict these positions from the ongoing generation, allowing inference to halt at that point. Across four datasets the method shortens outputs by 14 to 55 percent on average while preserving accuracy and more than halving latency relative to full generation.

Core claim

Terminator is an early-exit strategy that treats the first arrival of the final answer token as a proxy for the optimal stopping point, builds a dataset of these positions from existing chain-of-thought traces, and trains a lightweight predictor to forecast the same positions on new inputs, thereby shortening reasoning sequences with virtually no accuracy drop on MATH-500, AIME 2025, HumanEval, and GPQA.

What carries the argument

A predictor trained on first-answer positions extracted from chain-of-thought traces to forecast optimal exit points at inference time.

If this is right

CoT lengths drop 14-55 percent on average on MATH-500, AIME 2025, HumanEval, and GPQA while accuracy stays the same.
Inference latency falls by more than a factor of two compared with standard full-generation runs.
The approach outperforms prior state-of-the-art early-stopping baselines on the same four datasets.
Optimal exit points can be learned in a task- and model-specific manner without hand-crafted heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same first-answer labeling trick could be reused to create early-exit datasets for other autoregressive tasks such as code synthesis or multi-step planning.
If the predictor transfers across model families, it would allow a single stopping model to accelerate many different reasoning systems.
One could test whether the learned exit points correlate with human judgments of when a reasoning trace has become redundant.

Load-bearing premise

The first token position at which the model emits its final answer is a reliable proxy for the optimal stopping point and a predictor trained on these positions will generalize without harming accuracy on unseen problems or different models.

What would settle it

Running the trained predictor on a held-out test set and observing a statistically significant drop in final answer accuracy relative to full chain-of-thought generation would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.12529 by Alliot Nagle, Ashok Vardhan Makkuva, Dhia Garbaya, Hyeji Kim, Jakhongir Saydaliev, Michael Gastpar.

**Figure 1.** Figure 1: Early stopping via TERMINATOR. TERMINATOR is a binary probe classifier that predicts whether to exit or not at every CoT token. Once the majority of prediction bits within a window (10 here) are 1, </think> is injected into the LRM’s token stream to stop thinking (Sec. 4). performance gains across many challenging tasks (OpenAI, 2024). However, this improvement does not come for free, as an LRM will genera… view at source ↗

**Figure 2.** Figure 2: Event-Locked Averaging of Token-Confidence. Event-locked averaging shows a consistent agreement on spiking behavior at the answer position in each CoT, but disagrees elsewhere. On the other hand, this phenomenon is not readily observable in the single-sample case. Figures on the left show the Token-Confidence (Fu et al., 2025b) and log-probability trajectories throughout reasoning for a single, randomly se… view at source ↗

**Figure 3.** Figure 3: Token Usage Frequency Shift. “Thinking token” usage changes depending on whether the final answer has been generated in the CoT. Rates are computed by counting the raw number of occurrences of the token before and after the answer, and then normalizing each count by the respective number of tokens in the before and after bins. The arrival of the final answer is hinted at by changes in the rates for these t… view at source ↗

**Figure 4.** Figure 4: Training-Dataset Curation Process. We use an LRM to (1) extract final answer aˆ from final solution s, (2) identify the earliest position of aˆ in the CoT r, and (3) verify that the position was correct. If it was, then we can extract the exact position of aˆ from the CoT at the final token-index extraction step; otherwise, we retry the identification step with feedback. previous methods use much coarser g… view at source ↗

**Figure 5.** Figure 5: OOD Performance of TERMINATOR. The best tradeoff between accuracy and compression rate is achieved when the evaluation set is in-distribution with the training dataset. Here the out-of-distribution performance of TERMINATOR with respect to the compression rate (left) and the accuracy (right) for Qwen3-8B is shown. Training datasets are listed along the row axis, and the evaluation sets are listed across t… view at source ↗

**Figure 7.** Figure 7: TERMINATOR Recovers Event-Locked Average Spiking. The exit positions predicted by TERMINATOR (center) recover the same spiking behavior in the event-locked averaged Token-Confidence as the ground-truth answer positions (left). The histogram of differences between the exit positions (right) shows that TERMINATOR’s predicted exit positions are close to the ground-truth. Note that the y-axis on the histogram … view at source ↗

**Figure 8.** Figure 8: TERMINATOR Token Usage Biases. The exit positions predicted by TERMINATOR recover the same biases in the “thinking token” occurrence rates as the ground-truth answer positions. The inset axes on each panel show the percentage of dots that lie above the diagonal when the ground-truth and TERMINATOR answer positions are used. like another are more frequently used after. Moreover, Figs. 14 to 18 in App. B sho… view at source ↗

**Figure 9.** Figure 9: Pareto Frontier. TERMINATOR consistently achieves strong Pareto efficiency, reflected by the best accuracy–efficiency tradeoff across models and tasks. Here we plot the Pareto frontiers of accuracy versus compression rate across four reasoning models (Qwen3-8B, Qwen3-14B, Ministral-8B, Ministral-14B) and four benchmarks (MATH-500, AIME25, HumanEval, GPQA). Each point represents a method’s accuracy and comp… view at source ↗

**Figure 10.** Figure 10: Event-Locked Averaging of Token-Confidence for AIME (1983–2024). A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Event-Locked Averaging of Token-Confidence for MATH. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Event-Locked Averaging of Token-Confidence for OpenCoder-SFT. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Event-Locked Averaging of Token-Confidence for OpenScience. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Token Usage Frequency Shift. An extension of the results shown in [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Token Usage Frequency Shift for AIME (1983–2024. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Token Usage Frequency Shift for MATH. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Token Usage Frequency Shift for OpenCoder-SFT. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Token Usage Frequency Shift for OpenScience. A reproduction of [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Token Occurrence Rate vs CoT Length. For some tokens, such as these three “thinking tokens,” occurrence rates decrease rapidly as CoT length increases. Interestingly, the side of the answer (either “before” or “after”) with the highest rate is the one that decays the most (the “before” rate for hmm and okay and the “after” rate for another), while the side with the lower rate sees only a slight decrease o… view at source ↗

**Figure 20.** Figure 20: Predicted Probabilities Event-Locked Averaging. The dashed vertical line shows where TERMINATOR terminates the CoT with a sliding window of 10 and an exit threshold of 0.7, as indicated by the horizontal dotted line. We show the average predicted probability stream across all test problems from MATH-500, AIME25, HumanEval, and GPQA. Figs. 21 to 24 show predictions streams from individual, randomly drawn s… view at source ↗

**Figure 21.** Figure 21: Predicted Probabilities for AIME25. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from AIME25. 0 500 1000 1500 2000 2500 3000 3500 4000 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability for Exiting MATH-500 - Sample 398 Predicted Exit Position Exit Threshold 0 250 500 750 1000 1250 1500 1750 0.0 0.2 0.4 0.6 0.8 1.0 MATH-500 - Sample 107 Predicted Exit Position Exi… view at source ↗

**Figure 22.** Figure 22: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from MATH-500. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Predicted Probabilities for HumanEval. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from HumanEval. 0 250 500 750 1000 1250 1500 1750 2000 0.0 0.2 0.4 0.6 0.8 1.0 Predicted Probability for Exiting GPQA - Sample 825 Predicted Exit Position Exit Threshold 0 1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 1.0 GPQA - Sample 793 Predicted Exit Position Exit Threshold… view at source ↗

**Figure 24.** Figure 24: Predicted Probabilities for GPQA. TERMINATOR’s predicted probability stream for early-exiting on four randomly chosen samples from GPQA. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗

**Figure 25.** Figure 25: Changing the Predictive Threshold. The effect of changing the threshold of predicting 0 or 1 is altered from 0.1 to 0.9. Notably, the compression rate exhibits the largest change from τ “ 0.1 to τ “ 0.9 for all datasets, compared to the change in accuracy. In particular, some datasets, like MATH-500 and HumanEval, exhibit very little performance drop. 0 1000 2000 3000 4000 5000 6000 0 5 10 15 20 Count AIM… view at source ↗

**Figure 26.** Figure 26: First Answer Occurrence Histogram. A histogram of the first occurrence of the final answer for each data source used in our training dataset is shown. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_26.png] view at source ↗

**Figure 27.** Figure 27: Compression Ratio Histograms for TERMINATOR. Each histogram shows the frequency of an achieved compression rate when early-exiting with TERMINATOR. Ó indicates that a lower compression rate is better, as it results in more tokens saved with our method. Three CoTs are sampled per data source. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_27.png] view at source ↗

**Figure 28.** Figure 28: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_28.png] view at source ↗

**Figure 29.** Figure 29: Predicted Probabilities for MATH-500. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_29.png] view at source ↗

**Figure 30.** Figure 30: Predicted Probabilities for AIME25. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from AIME-2025. The beginning and the end are truncated for better visibility. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_30.png] view at source ↗

**Figure 31.** Figure 31: Predicted Probabilities for AIME25. TERMINATOR’s predicted probabilities for early-exit on a randomly chosen sample from AIME-2025. The beginning and the end are truncated for better visibility. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_31.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Terminator labels early-exit points from first final-answer tokens and trains a predictor that cuts CoT length 14-55% with claimed accuracy hold on four benchmarks, but the proxy optimality needs direct verification.

read the letter

The core contribution is a simple labeling trick: run the LRM once, record the token position where it first emits the final answer (boxed format or equivalent), and treat that position as the target exit point. They build a dataset from these positions and train a separate early-exit head to predict when to stop during inference. On MATH-500, AIME 2025, HumanEval, and GPQA the method shortens chains 14-55% on average, beats prior early-exit baselines, and delivers more than 2x latency reduction while the abstract states performance stays essentially flat. That is a concrete, deployable idea for anyone running large reasoning models where overthinking inflates cost. The construction avoids some circularity because the predictor is a distinct learned component rather than a post-hoc rule on the same outputs. The reported gains are large enough to matter for production inference. The soft spot is the unverified assumption that the first answer token is reliably optimal. If additional tokens after that point sometimes correct errors on harder examples, the proxy labels will systematically under-estimate the safe stopping point and the accuracy claim will not hold on new prompts or models. The abstract gives no error bars, no description of how accuracy equivalence was measured across runs, and no explicit held-out check that the learned predictor does not shift exits into a regime that drops correctness. Those details determine whether the headline numbers survive scrutiny. This paper is aimed at groups working on efficient CoT inference rather than pure theory. The method is straightforward enough that a careful referee could quickly assess the experimental controls and generalization tests. It deserves peer review because the technique is new, the benchmarks are practical, and the potential impact on latency is real if the accuracy preservation is shown to be robust.

Referee Report

2 major / 0 minor

Summary. The paper introduces Terminator, an early-exit strategy for Large Reasoning Models performing Chain-of-Thought reasoning. It constructs training labels from the first token position at which the model emits a final answer (e.g., in boxed format), trains a separate predictor on these positions as proxies for optimal stopping points, and reports 14-55% average CoT length reductions on MATH-500, AIME 2025, HumanEval, and GPQA while outperforming prior methods and cutting latency by more than 2x with virtually no accuracy change.

Significance. If the central claims hold after verification, the work would offer a practical, model-agnostic way to reduce overthinking and inference cost in LRMs on hard reasoning tasks, with clear deployment value for latency-sensitive applications.

major comments (2)

[Abstract] Abstract: the headline guarantee of 'virtually no change in performance' is load-bearing for the contribution, yet the manuscript provides no information on how accuracy equivalence was measured across datasets, whether error bars or statistical tests were used, or how the training set was constructed to avoid leakage from the same LRM whose outputs supply the labels.
[Method] Central method (first-answer-position labeling): treating the first emission of a final answer as the optimal exit label requires explicit empirical verification on held-out prompts that tokens generated after this point never improve correctness; absent such a check, any systematic mismatch between the proxy and true optimality directly undermines the no-accuracy-loss claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the headline guarantee of 'virtually no change in performance' is load-bearing for the contribution, yet the manuscript provides no information on how accuracy equivalence was measured across datasets, whether error bars or statistical tests were used, or how the training set was constructed to avoid leakage from the same LRM whose outputs supply the labels.

Authors: We thank the referee for this observation. Accuracy equivalence was evaluated via exact-match on the final answer (e.g., boxed content) between full CoT and early-exit outputs on identical test prompts. In the revised manuscript we will expand the abstract and add an explicit evaluation subsection that reports the precise metric, includes error bars computed over 3 independent runs with different random seeds, and describes the training-set construction: labels were generated on a disjoint collection of prompts (distinct from all evaluation sets) to prevent leakage from the same LRM. revision: yes
Referee: [Method] Central method (first-answer-position labeling): treating the first emission of a final answer as the optimal exit label requires explicit empirical verification on held-out prompts that tokens generated after this point never improve correctness; absent such a check, any systematic mismatch between the proxy and true optimality directly undermines the no-accuracy-loss claim.

Authors: We agree that a direct empirical check would strengthen the optimality assumption. While our reported results already demonstrate that early exits at the predicted positions preserve accuracy, the manuscript does not contain an explicit ablation confirming that post-first-answer tokens never alter correctness on held-out prompts. We will add this verification in the revised version by continuing generation after the first-answer position on a held-out subset and measuring any change in final-answer correctness; the results and any necessary discussion of the proxy's limitations will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity: labels are observable first-answer positions; performance claims rest on empirical evaluation

full rationale

The paper constructs a dataset of reasoning lengths directly from the observable first token positions at which the LRM emits a final answer during full generation. Terminator is trained as a standard supervised predictor on these labels. The headline performance claims (length reduction with virtually no accuracy change) are supported by direct empirical measurements on held-out examples from MATH-500, AIME 2025, HumanEval, and GPQA rather than by any definitional equivalence or self-referential reduction. No equation or step equates the learned exit prediction to its own training labels by construction, and no load-bearing self-citation or imported uniqueness theorem is invoked. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that first-answer positions are stable across runs and that a small auxiliary model can learn them without additional regularization details.

pith-pipeline@v0.9.0 · 5560 in / 1137 out tokens · 60951 ms · 2026-05-15T11:09:51.842018+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Liu, et al

URL https://openreview.net/forum? id=MSbU3L7V00. Cheng, J. and Durme, B. V . Compressed chain of thought: Efficient reasoning through dense representations, 2024. URLhttps://arxiv.org/abs/2412.13171. Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceed- ings...

work page doi:10.18653/v1/2023.emnlp-main 2024
[2]

emnlp-main.232/

URL https://aclanthology.org/2023. emnlp-main.232/. Ding, B., Chen, Y ., Wang, F., Ming, L., and Lin, T. Do thinking tokens help or trap? towards more efficient large reasoning model, 2025. URL https://arxiv.org/ abs/2506.23840. Fu, Y ., Chen, J., Zhu, S., Fu, Z., Dai, Z., Zhuang, Y ., Ma, Y ., Qiao, A., Rosing, T., Stoica, I., and Zhang, H. Ef- ficiently...

work page arXiv 2023
[3]

Training Large Language Models to Reason in a Continuous Latent Space

URL https://openreview.net/forum? id=uREj4ZuGJE. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Lu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41586-025-09422-z 2025
[4]

URL https://openreview.net/forum? id=9YvfRrpmyw. Jung, H. and Kim, K.-J. Discrete prompt compression with reinforcement learning.IEEE Access, 12:72578–72587,

work page
[5]

doi: 10.1109/access.2024

ISSN 2169-3536. doi: 10.1109/access.2024. 3403426. URL http://dx.doi.org/10.1109/ ACCESS.2024.3403426. Kang, Y ., Sun, X., Chen, L., and Zou, W. C3ot: Generating shorter chain-of-thought without compromising effective- ness.Proceedings of the AAAI Conference on Artificial Intelligence, 39(23):24312–24320, Apr. 2025a. doi: 10. 1609/aaai.v39i23.34608. URL h...

work page doi:10.1109/access.2024 2024
[6]

URL https://openreview.net/forum? id=v8L0pN6EOi. Liu, A. H., Khandelwal, K., Subramanian, S., Jouault, V ., Rastogi, A., Sadé, A., Jeffares, A., Jiang, A., Cahill, A., Gavaudan, A., Sablayrolles, A., Héliou, A., You, A., Ehrenberg, A., Lo, A., Eliseev, A., Calvi, A., Soori- yarachchi, A., Bout, B., Rozière, B., Monicault, B. D., Lanfranchi, C., Barreau, C...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

URL https: //aclanthology.org/2025.acl-long.300/

doi: 10.18653/v1/2025.acl-long.300. URL https: //aclanthology.org/2025.acl-long.300/. Mao, M., Yin, B., Zhu, Y ., and Fang, X. Early stopping chain-of-thoughts in large language models, 2025. URL https://arxiv.org/abs/2509.14004. Mu, J., Li, X. L., and Goodman, N. Learning to com- press prompts with gist tokens. InThirty-seventh Conference on Neural Infor...

work page doi:10.18653/v1/2025.acl-long.300 2025
[8]

s1: Simple test-time scaling

URL https://openreview.net/forum? id=2DtxPCL3T5. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393. Nagle, A., Girish, A., Bondaschi, M., Gastpar, M., Makkuva, A. V ., and Kim, H. Fundamental limits o...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-3009 2025
[9]

InFindings of the Association for Computational Linguistics: EMNLP 2025 , Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.)

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 13 TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning 6fac9e316a4ae75ea244ddcef1982c71-Paper-Conference. pdf. Shrivastava, V ., Awadallah, A., Balachandran, V ., Garg, S., Behl, H., and Papailiopoulos, D. Sample more to think less: Group filtered policy...

work page doi:10.18653/v1/2025.findings-emnlp 2022
[13]

ISBN 979-8-89176-332-6

URL https://aclanthology.org/2025. emnlp-main.184/. Zhang, J., Zhu, Y ., Sun, M., Luo, Y ., Qiao, S., Du, L., Zheng, D., Chen, H., and Zhang, N. LightThinker: Thinking step-by-step compression. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processi...

work page doi:10.18653/v1/2025.emnlp-main 2025
[14]

Extract the final answer from:

URL https://aclanthology.org/2025. emnlp-main.673/. Zhang, Z., He, X., Yan, W., Shen, A., Zhao, C., Wang, S., Shen, Y ., and Wang, X. E. Soft thinking: Un- locking the reasoning potential of llms in continuous 14 TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning concept space, 2025d. URL https://arxiv.org/ abs/2505....

work page arXiv 2025