ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

Fei Yu; Han Xia; Jiaqing Liang; Jinyi Han; Shuguang Ma; Sihang Jiang; Tingyun Li; Xinyi Wang; Yanghua Xiao; Zhaoqian Dai

arxiv: 2606.19919 · v1 · pith:UJPLXVL4new · submitted 2026-06-18 · 💻 cs.LG

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

Tingyun Li , Zishang Jiang , Jinyi Han , Xinyi Wang , Sihang Jiang , Han Xia , Zhaoqian Dai , Shuguang Ma

show 3 more authors

Fei Yu Jiaqing Liang Yanghua Xiao

This is my paper

Pith reviewed 2026-06-26 18:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords large reasoning modelschain-of-thoughtefficiency optimizationtoken-level decouplingmode-selection tokenPareto frontierinference controldual-process thinking

0 comments

The pith

ADaPT decouples efficiency and correctness at the token level with a mode-selection token so one model can control the efficiency-performance trade-off at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models rely on long chain-of-thought sequences for strong performance, but uniform application raises high computational costs. Prior efficiency methods often degrade capability because they apply incentives at the full-sequence level and thereby penalize correct long trajectories. ADaPT introduces a mode-selection token that governs fast versus slow reasoning and restricts efficiency-related rewards to that token alone. This separation preserves the model's ability to generate accurate long reasoning on other tokens while permitting the user to steer the efficiency-performance balance by changing the token's generation probability during inference.

Core claim

ADaPT is a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training by introducing a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. This design enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier.

What carries the argument

The mode-selection token that decides between fast and slow reasoning modes, with efficiency rewards applied exclusively to it during training.

If this is right

Inference cost drops while reasoning performance stays strong across multiple benchmarks.
A single trained model reaches any point on the efficiency-performance Pareto frontier by changing one token's probability.
Efficiency incentives no longer implicitly penalize correct long reasoning paths.
The approach applies without requiring separate models for different operating points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-level isolation could be used to control other generation attributes such as response length or explicit uncertainty.
External conditioning on the mode token might allow per-query adaptation without changing the trained weights.
If the separation holds, training pipelines could add multiple orthogonal control tokens for different objectives.

Load-bearing premise

Efficiency-related rewards can be trained exclusively on the mode-selection token without degrading the model's ability to generate correct long reasoning trajectories on the remaining tokens.

What would settle it

Measuring whether varying the mode-selection token's generation probability produces a smooth accuracy-versus-compute curve on a held-out reasoning benchmark, versus observing sharp performance drops when the probability is shifted toward efficiency.

Figures

Figures reproduced from arXiv: 2606.19919 by Fei Yu, Han Xia, Jiaqing Liang, Jinyi Han, Shuguang Ma, Sihang Jiang, Tingyun Li, Xinyi Wang, Yanghua Xiao, Zhaoqian Dai, Zishang Jiang.

**Figure 2.** Figure 2: Clipping ratio during GRPO training. Clip [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy and length trade-off of ADaPT on Easy (left) and Hard (right) tasks under different [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution and accuracy of <answer> and <think> modes on Easy and Hard tasks. ADaPT adaptively selects fast reasoning for Easy tasks and increases slow reasoning only when required on Hard tasks, achieving a superior balance between efficiency and reasoning performance. 0.00 0.25 0.50 0.75 1.00 20 30 40 50 60 70 80 90 100 Ratio of <think> mode(%) 7B 3B [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of γ on the proportion of slow reasoning (<think>) usage for 7B and 3B models. Larger γ monotonically increases slow reasoning usage, with the 3B model showing a stronger shift due to lower fast reasoning reliability. 5 Related Work 5.1 Length Compression Recently, many studies have focused on improving the reasoning efficiency of LLMs. Some promptbased approaches aim to simplify reasoning by modi… view at source ↗

**Figure 6.** Figure 6: Examples of GRPO Training Data System Prompt /*System Prompt*/ Please reason step by step. For difficult questions, output <think> and engage long thinking mode. For simple questions, output <answer> and engage short thinking mode. Provide your final answer within \boxed{}. /*User Prompt*/ Let a > 0, and let P(x) be a polynomial with integer coefficients such that P(1) = P(3) = P(5) = P(7) = a and P(2) = P… view at source ↗

read the original abstract

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ADaPT's mode-selection token decouples efficiency rewards from reasoning tokens to enable inference-time Pareto control.

read the letter

The main things here are the mode-selection token for decoupling efficiency and correctness signals, and the resulting ability to control the trade-off at inference by adjusting that token's probability.

The paper does a good job identifying the sequence-level coupling as the root cause and designing the training to apply efficiency rewards only to the special token. This avoids penalizing correct long reasoning, which is a sensible fix. The continuous control feature is also a plus for flexibility.

The soft spots are around whether the decoupling works in practice without side effects on reasoning quality. The assumption that efficiency training on just the mode-selection token won't hurt the rest of the model needs solid evidence from the experiments. Since the abstract claims strong results across benchmarks, the paper likely has data on this, but the details matter for how general it is.

This work is for researchers focused on efficient reasoning in large models. It has a clear mechanism and targets a real problem, so it deserves serious peer review to scrutinize the implementation and results.

I would recommend sending it to referees.

Referee Report

2 major / 0 minor

Summary. The paper identifies sequence-level coupling between efficiency incentives and correctness optimization as the root cause of degraded reasoning performance in existing efficiency methods for large reasoning models. It proposes Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that introduces a dedicated mode-selection token to control fast vs. slow reasoning modes. Efficiency-related rewards are applied exclusively to this token during training to avoid penalizing correct long trajectories, while at inference the generation probability of the mode-selection token can be adjusted to enable continuous control along the efficiency-performance Pareto frontier. The abstract states that extensive experiments demonstrate reduced inference cost while maintaining strong reasoning performance across benchmarks.

Significance. If the token-level decoupling mechanism works as described without side effects on reasoning quality, the approach could offer a practical method for training a single model that supports flexible, inference-time trade-off control. This would be a meaningful advance for deploying reasoning models under varying compute constraints, provided the experimental results substantiate the claims about decoupling and Pareto control.

major comments (2)

[Abstract] The central claim that restricting efficiency rewards exclusively to the mode-selection token successfully decouples signals without degrading correctness on remaining tokens (the weakest assumption noted) is load-bearing but cannot be evaluated, as no training objective, reward formulation, or ablation results are provided in the available text.
[Abstract] The assertion of 'precise and continuous control' over the efficiency-performance trade-off via adjustment of the mode-selection token probability at inference requires empirical demonstration (e.g., smooth Pareto curves across multiple values); the abstract states this but supplies no supporting figures, tables, or quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying areas where the abstract's claims require stronger support in the provided text. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The central claim that restricting efficiency rewards exclusively to the mode-selection token successfully decouples signals without degrading correctness on remaining tokens (the weakest assumption noted) is load-bearing but cannot be evaluated, as no training objective, reward formulation, or ablation results are provided in the available text.

Authors: We agree that the abstract, as presented, does not include the training objective, reward formulation, or ablation results, making the decoupling claim difficult to evaluate from the visible text alone. The complete manuscript details the token-level reward application in Section 3 and provides ablations in Section 4.2 confirming preserved correctness on non-mode tokens. To address this directly, we will revise the abstract to include a concise reference to the token-level decoupling mechanism and move a high-level description of the objective into the introduction. revision: yes
Referee: [Abstract] The assertion of 'precise and continuous control' over the efficiency-performance trade-off via adjustment of the mode-selection token probability at inference requires empirical demonstration (e.g., smooth Pareto curves across multiple values); the abstract states this but supplies no supporting figures, tables, or quantitative results.

Authors: We acknowledge that the abstract asserts precise and continuous control without accompanying empirical evidence in the provided text. The full manuscript includes these demonstrations via Pareto curves and quantitative results across multiple probability values in Section 5 and the associated figures. We will revise the abstract to reference the experimental validation of the Pareto frontier or qualify the claim with a pointer to the results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces ADaPT as a new token-level dual-process framework using a dedicated mode-selection token with efficiency rewards restricted to it. The abstract and description present this as an architectural choice to address sequence-level coupling, with inference-time control via token probability adjustment. No equations, self-citations, or derivations are shown that reduce the claimed benefits (decoupling, Pareto control) to fitted inputs or prior self-referential results by construction. The central claims remain independent of the inputs they are derived from.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on the abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5743 in / 1000 out tokens · 21050 ms · 2026-06-26T18:24:34.390607+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 3 linked inside Pith

[1]

ArXiv , year=

OpenAI o1 System Card , author=. ArXiv , year=
[2]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[3]

ArXiv , year=

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models , author=. ArXiv , year=
[4]

2025 , url=

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , url=

2025
[5]

ArXiv , year=

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning , author=. ArXiv , year=
[6]

ArXiv , year=

ARM: Adaptive Reasoning Model , author=. ArXiv , year=
[7]

2025 , url=

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning , author=. 2025 , url=

2025
[8]

ArXiv , year=

Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models , author=. ArXiv , year=
[9]

2025 , url=

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs , author=. 2025 , url=

2025
[10]

ArXiv , year=

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking , author=. ArXiv , year=
[11]

ArXiv , year=

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping , author=. ArXiv , year=
[12]

ArXiv , year=

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning , author=. ArXiv , year=
[13]

ArXiv , year=

Thinkless: LLM Learns When to Think , author=. ArXiv , year=
[14]

ArXiv , year=

DynamicMind: A Tri-Mode Thinking System for Large Language Models , author=. ArXiv , year=
[15]

ArXiv , year=

AdaptThink: Reasoning Models Can Learn When to Think , author=. ArXiv , year=
[16]

ArXiv , year=

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , author=. ArXiv , year=
[17]

ArXiv , year=

Kimi k1.5: Scaling Reinforcement Learning with LLMs , author=. ArXiv , year=
[18]

ArXiv , year=

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models , author=. ArXiv , year=
[19]

arXiv preprint arXiv:2508.21113 , year=

R-4b: Incentivizing general-purpose auto-thinking capability in mllms via bi-mode annealing and reinforce learning , author=. arXiv preprint arXiv:2508.21113 , year=

arXiv
[20]

ArXiv , year=

Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging , author=. ArXiv , year=
[21]

ArXiv , year=

s1: Simple test-time scaling , author=. ArXiv , year=
[22]

ArXiv , year=

Chain of Draft: Thinking Faster by Writing Less , author=. ArXiv , year=
[23]

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , year=

Renze, Matthew and Guven, Erhan , booktitle=. The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , year=
[24]

ArXiv , year=

Token-Budget-Aware LLM Reasoning , author=. ArXiv , year=
[25]

ArXiv , year=

DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models , author=. ArXiv , year=
[26]

ArXiv , year=

Demystifying Long Chain-of-Thought Reasoning in LLMs , author=. ArXiv , year=
[27]

ArXiv , year=

RouteLLM: Learning to Route LLMs with Preference Data , author=. ArXiv , year=
[28]

ArXiv , year=

Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow , author=. ArXiv , year=
[29]

ArXiv , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=
[30]

2025 , url=

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization , author=. 2025 , url=

2025
[31]

ArXiv , year=

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. ArXiv , year=
[32]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=
[33]

ArXiv , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. ArXiv , year=
[34]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=
[35]

ArXiv , year=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. ArXiv , year=
[36]

ArXiv , year=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. ArXiv , year=
[37]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[38]

Annual Meeting of the Association for Computational Linguistics , year=

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. Annual Meeting of the Association for Computational Linguistics , year=
[39]

ArXiv , year=

Training Language Models to Reason Efficiently , author=. ArXiv , year=
[40]

ArXiv , year=

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs , author=. ArXiv , year=
[41]

ArXiv , year=

Automatic Curriculum Expert Iteration for Reliable LLM Reasoning , author=. ArXiv , year=
[42]

ArXiv , year=

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation , author=. ArXiv , year=
[43]

ArXiv , year=

Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning , author=. ArXiv , year=
[44]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[45]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models , author=. Trans. Mach. Learn. Res. , year=
[46]

ArXiv , year=

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach , author=. ArXiv , year=
[47]

ArXiv , year=

ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning , author=. ArXiv , year=
[48]

Jonathan St B. T. Evans , doi =. In Two Minds: Dual-Process Accounts of Reasoning , volume =. Trends in Cognitive Sciences , number =
[49]

ArXiv , year=

A Long Way to Go: Investigating Length Correlations in RLHF , author=. ArXiv , year=
[50]

arXiv preprint arXiv:1207.4115 , year=

Dynamic programming for structured continuous Markov decision problems , author=. arXiv preprint arXiv:1207.4115 , year=

Pith/arXiv arXiv
[51]

ArXiv , year=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. ArXiv , year=
[52]

ArXiv , year=

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , author=. ArXiv , year=
[53]

ArXiv , year=

The Llama 3 Herd of Models , author=. ArXiv , year=

[1] [1]

ArXiv , year=

OpenAI o1 System Card , author=. ArXiv , year=

[2] [2]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[3] [3]

ArXiv , year=

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models , author=. ArXiv , year=

[4] [4]

2025 , url=

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? , author=. 2025 , url=

2025

[5] [5]

ArXiv , year=

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning , author=. ArXiv , year=

[6] [6]

ArXiv , year=

ARM: Adaptive Reasoning Model , author=. ArXiv , year=

[7] [7]

2025 , url=

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning , author=. 2025 , url=

2025

[8] [8]

ArXiv , year=

Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models , author=. ArXiv , year=

[9] [9]

2025 , url=

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs , author=. 2025 , url=

2025

[10] [10]

ArXiv , year=

Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking , author=. ArXiv , year=

[11] [11]

ArXiv , year=

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping , author=. ArXiv , year=

[12] [12]

ArXiv , year=

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning , author=. ArXiv , year=

[13] [13]

ArXiv , year=

Thinkless: LLM Learns When to Think , author=. ArXiv , year=

[14] [14]

ArXiv , year=

DynamicMind: A Tri-Mode Thinking System for Large Language Models , author=. ArXiv , year=

[15] [15]

ArXiv , year=

AdaptThink: Reasoning Models Can Learn When to Think , author=. ArXiv , year=

[16] [16]

ArXiv , year=

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , author=. ArXiv , year=

[17] [17]

ArXiv , year=

Kimi k1.5: Scaling Reinforcement Learning with LLMs , author=. ArXiv , year=

[18] [18]

ArXiv , year=

ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models , author=. ArXiv , year=

[19] [19]

arXiv preprint arXiv:2508.21113 , year=

R-4b: Incentivizing general-purpose auto-thinking capability in mllms via bi-mode annealing and reinforce learning , author=. arXiv preprint arXiv:2508.21113 , year=

arXiv

[20] [20]

ArXiv , year=

Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging , author=. ArXiv , year=

[21] [21]

ArXiv , year=

s1: Simple test-time scaling , author=. ArXiv , year=

[22] [22]

ArXiv , year=

Chain of Draft: Thinking Faster by Writing Less , author=. ArXiv , year=

[23] [23]

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , year=

Renze, Matthew and Guven, Erhan , booktitle=. The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , year=

[24] [24]

ArXiv , year=

Token-Budget-Aware LLM Reasoning , author=. ArXiv , year=

[25] [25]

ArXiv , year=

DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models , author=. ArXiv , year=

[26] [26]

ArXiv , year=

Demystifying Long Chain-of-Thought Reasoning in LLMs , author=. ArXiv , year=

[27] [27]

ArXiv , year=

RouteLLM: Learning to Route LLMs with Preference Data , author=. ArXiv , year=

[28] [28]

ArXiv , year=

Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow , author=. ArXiv , year=

[29] [29]

ArXiv , year=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. ArXiv , year=

[30] [30]

2025 , url=

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization , author=. 2025 , url=

2025

[31] [31]

ArXiv , year=

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. ArXiv , year=

[32] [32]

ArXiv , year=

Qwen2.5 Technical Report , author=. ArXiv , year=

[33] [33]

ArXiv , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. ArXiv , year=

[34] [34]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

[35] [35]

ArXiv , year=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. ArXiv , year=

[36] [36]

ArXiv , year=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. ArXiv , year=

[37] [37]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[38] [38]

Annual Meeting of the Association for Computational Linguistics , year=

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems , author=. Annual Meeting of the Association for Computational Linguistics , year=

[39] [39]

ArXiv , year=

Training Language Models to Reason Efficiently , author=. ArXiv , year=

[40] [40]

ArXiv , year=

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs , author=. ArXiv , year=

[41] [41]

ArXiv , year=

Automatic Curriculum Expert Iteration for Reliable LLM Reasoning , author=. ArXiv , year=

[42] [42]

ArXiv , year=

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation , author=. ArXiv , year=

[43] [43]

ArXiv , year=

Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning , author=. ArXiv , year=

[44] [44]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

[45] [45]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models , author=. Trans. Mach. Learn. Res. , year=

[46] [46]

ArXiv , year=

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach , author=. ArXiv , year=

[47] [47]

ArXiv , year=

ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning , author=. ArXiv , year=

[48] [48]

Jonathan St B. T. Evans , doi =. In Two Minds: Dual-Process Accounts of Reasoning , volume =. Trends in Cognitive Sciences , number =

[49] [49]

ArXiv , year=

A Long Way to Go: Investigating Length Correlations in RLHF , author=. ArXiv , year=

[50] [50]

arXiv preprint arXiv:1207.4115 , year=

Dynamic programming for structured continuous Markov decision problems , author=. arXiv preprint arXiv:1207.4115 , year=

Pith/arXiv arXiv

[51] [51]

ArXiv , year=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. ArXiv , year=

[52] [52]

ArXiv , year=

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning , author=. ArXiv , year=

[53] [53]

ArXiv , year=

The Llama 3 Herd of Models , author=. ArXiv , year=