Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

Delai Qiu; Jiaen Liang; Jian Sun; Jitao Sang; Shengping Liu; Shuyu Wei; Wei Huang; Ying Fu; Yining Wang

arxiv: 2605.19358 · v1 · pith:65O4E7ZLnew · submitted 2026-05-19 · 💻 cs.CL

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

Shuyu Wei , Jian Sun , Delai Qiu , Yining Wang , Shengping Liu , Jiaen Liang , Ying Fu , Wei Huang

show 1 more author

Jitao Sang

This is my paper

Pith reviewed 2026-05-20 06:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords Conditional Entropy ShapingLLM reasoningtoken-level entropyadaptive explorationmathematical benchmarksresponse lengthDAPO

0 comments

The pith

Conditional Entropy Shaping lets LLMs shorten responses on easy problems while exploring more on hard ones by shaping token entropy conditionally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Conditional Entropy Shaping (CES) to address the trade-off in LLM reasoning between accuracy and response length. It builds on DAPO by treating token-level entropy at forking points as an uncertainty signal. The method applies a conditional policy that penalizes high-entropy tokens on correct paths to encourage conciseness and rewards them on incorrect paths to promote deeper exploration and error correction. On 12 mathematical benchmarks using DeepSeek-R1-Distill-7B, CES raises average accuracy while cutting response length compared to the baseline. Similar patterns appear with a 1.5B model and on out-of-domain tasks.

Core claim

CES applies a conditional bidirectional policy to token-level entropy at forking points: it penalizes high entropy on correct reasoning paths to produce more concise solutions and rewards high entropy on incorrect paths to encourage exploration and error correction, yielding higher accuracy and shorter responses than DAPO across mathematical benchmarks.

What carries the argument

Conditional bidirectional policy on token-level entropy at forking points, which penalizes uncertainty on correct paths and rewards it on incorrect paths to adapt reasoning depth.

Load-bearing premise

Token-level entropy at forking points serves as a reliable uncertainty signal that can decide whether to penalize or reward exploration on a given reasoning path.

What would settle it

If the conditional entropy adjustments produce lower accuracy or longer responses than DAPO on the 12 mathematical benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.19358 by Delai Qiu, Jiaen Liang, Jian Sun, Jitao Sang, Shengping Liu, Shuyu Wei, Wei Huang, Ying Fu, Yining Wang.

**Figure 1.** Figure 1: Overview of our CES pipeline. explicit generation of reasoning steps, while crucial for accuracy on complex tasks, inherently increases the number of generated tokens, leading to high latency and computational costs that can hinder real-world applications. This underscores a core dilemma in the field. On one hand, to achieve the highest possible performance, models are encouraged to explore detailed rea… view at source ↗

**Figure 2.** Figure 2: Training dynamics of average response length (a), entropy (b), and accuracy (c) for the DAPO baseline [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of average response length, strat [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CES layers a conditional bidirectional entropy policy onto DAPO to shorten correct paths and explore on wrong ones, but the abstract gives almost no numbers or controls to show the signal actually works.

read the letter

The paper's core move is to take DAPO and add a rule that watches token-level entropy at forking points: penalize high entropy when the path reaches a correct answer, reward it when the path is wrong. The goal is to get shorter outputs on easy problems and more exploration on hard ones without needing an external verifier. That conditional framing is the main thing that feels new on top of earlier entropy-based reasoning papers. The reported results on DeepSeek-R1-Distill-7B across 12 math benchmarks show higher average accuracy and lower response length than the DAPO baseline, with similar patterns on a 1.5B model and some out-of-domain tasks. Those trends are the part worth paying attention to if the numbers hold up in the full text. The experiments at least try to check transfer, which is better than many quick ablations in this area. The soft spots sit in the evidence and the central assumption. The abstract claims consistent gains but supplies no effect sizes, variance numbers, statistical tests, or ablation tables, so it is difficult to tell how large or stable the improvements are or whether the entropy term is really driving them. The key premise—that local next-token entropy at branch points reliably indicates when further exploration is useful—could easily be noisy. Entropy can spike from syntactic choices or early ambiguity that has nothing to do with final correctness, and if those cases are common the policy might shorten good paths too much or waste compute on paths that stay wrong. Without the full methods and results it is hard to judge whether the correlation is strong enough to support the adaptive behavior claimed. This work would interest people who deploy reasoning models and care about trading off latency against accuracy on math-style tasks. A reader who already follows DAPO and entropy regularization papers will get the most out of it. The paper deserves a serious referee so the details on the policy implementation, the exact forking-point detection, and the ablations can be checked properly. I would send it for peer review with instructions to focus on whether the entropy signal correlates with actual correctness and whether the length-accuracy trade-off survives proper controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces Conditional Entropy Shaping (CES), which extends DAPO by using token-level entropy at forking points as an uncertainty signal. It applies a conditional bidirectional policy that penalizes high-entropy tokens on paths leading to correct answers (to promote conciseness) and rewards them on paths leading to incorrect answers (to encourage exploration). Experiments on DeepSeek-R1-Distill-7B across 12 mathematical benchmarks report consistent accuracy gains alongside reduced response lengths, with supplementary results on a 1.5B model and out-of-domain tasks.

Significance. If the empirical results and the underlying assumption hold, CES provides a practical mechanism for adaptive reasoning depth that avoids the uniform lengthening or accuracy trade-offs seen in prior entropy-based methods. The approach is grounded in an observable signal rather than direct fitting to accuracy, which supports its potential for broader application in controllable LLM inference.

major comments (3)

[§3] §3 (Method): The central claim that token-level entropy at forking points reliably signals the value of further exploration or the feasibility of deterministic shortening is load-bearing for the bidirectional policy. The manuscript does not provide per-path or per-forking-point analysis demonstrating that high-entropy tokens on incorrect paths correlate with recoverable errors or that penalizing them on correct paths preserves accuracy; aggregate benchmark averages alone do not establish this correlation.
[§4.1] §4.1 (Main results): The reported improvements over DAPO are described as 'consistent' but the text supplies no effect sizes, standard deviations, statistical significance tests, or ablation controls isolating the conditional entropy term. Without these, it is impossible to determine whether the accuracy gains and length reductions are attributable to CES or to other implementation details.
[§4.3] §4.3 (Ablations and smaller model): The supplementary experiments on the 1.5B backbone and out-of-domain tasks are summarized as showing 'similar trends,' yet no quantitative comparison tables or controls for the entropy threshold and reward scaling are presented. This weakens the generality claim.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit numerical deltas (e.g., average accuracy lift and token reduction percentages) rather than qualitative statements of improvement.
[§3.2] Notation for the conditional reward function (e.g., how the sign of the entropy bonus is determined from final answer correctness) should be formalized in an equation rather than described only in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate revisions to provide stronger empirical grounding for our claims.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that token-level entropy at forking points reliably signals the value of further exploration or the feasibility of deterministic shortening is load-bearing for the bidirectional policy. The manuscript does not provide per-path or per-forking-point analysis demonstrating that high-entropy tokens on incorrect paths correlate with recoverable errors or that penalizing them on correct paths preserves accuracy; aggregate benchmark averages alone do not establish this correlation.

Authors: We agree that per-path and per-forking-point analysis would more directly support the core assumption underlying the bidirectional policy. In the revised manuscript we will add a targeted analysis (new subsection in §3 or §4) that examines representative forking points, reports entropy distributions conditioned on path correctness, and shows how the policy adjustments correlate with error recovery on incorrect paths and accuracy retention on correct paths. This will move beyond aggregate averages while remaining within the existing experimental data. revision: yes
Referee: [§4.1] §4.1 (Main results): The reported improvements over DAPO are described as 'consistent' but the text supplies no effect sizes, standard deviations, statistical significance tests, or ablation controls isolating the conditional entropy term. Without these, it is impossible to determine whether the accuracy gains and length reductions are attributable to CES or to other implementation details.

Authors: We acknowledge the need for more rigorous statistical presentation. The revised version will report standard deviations across repeated runs, effect sizes for accuracy and length changes, and statistical significance tests (paired t-tests or equivalent) comparing CES to DAPO. We will also add an ablation that isolates the conditional entropy term by comparing the full CES policy against a variant that removes the bidirectional conditioning while keeping all other hyperparameters fixed. revision: yes
Referee: [§4.3] §4.3 (Ablations and smaller model): The supplementary experiments on the 1.5B backbone and out-of-domain tasks are summarized as showing 'similar trends,' yet no quantitative comparison tables or controls for the entropy threshold and reward scaling are presented. This weakens the generality claim.

Authors: We will expand the supplementary material with full quantitative tables for the 1.5B model and out-of-domain tasks, reporting accuracy, length, and variance metrics. These tables will include sensitivity controls that vary the entropy threshold and reward scaling factor, demonstrating that the observed trends hold across a range of these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity: CES policy defined from observable entropy signal, not from target accuracy or length metrics.

full rationale

The paper proposes Conditional Entropy Shaping (CES) as a control mechanism that takes token-level entropy at forking points as an input uncertainty signal and applies a conditional bidirectional reward/penalty rule based on whether the path is correct or incorrect. This construction does not define the entropy signal in terms of the final accuracy or length outcomes, nor does it fit parameters to the evaluation benchmarks and then relabel those fits as predictions. The reported improvements are presented as empirical results from running the defined policy on 12 benchmarks, not as tautological consequences of the method's own equations. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the core policy. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the assumption that entropy is a usable uncertainty signal.

pith-pipeline@v0.9.0 · 5725 in / 1042 out tokens · 36959 ms · 2026-05-20T06:35:44.311958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 15 internal anchors

[1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[2]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do not think that much for 2+3=? on the overthinking of o1-like llms , author=. arXiv preprint arXiv:2412.21187 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Reasoning models can be effective without thinking , author=. arXiv preprint arXiv:2504.09858 , year=

work page arXiv
[7]

Pencil: Long thoughts with short memory, 2025

Pencil: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv
[8]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Reasoning with Exploration: An Entropy Perspective

Reasoning with Exploration: An Entropy Perspective , author=. arXiv preprint arXiv:2506.14758 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2503.21961 , year=

Entropy-Aware Branching for Improved Mathematical Reasoning , author=. arXiv preprint arXiv:2503.21961 , year=

work page arXiv
[12]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2505.21178 , year=

Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning , author=. arXiv preprint arXiv:2505.21178 , year=

work page arXiv
[15]

and Zuo, C

Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models , author=. arXiv preprint arXiv:2504.09696 , year=

work page arXiv
[16]

Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning , author=. arXiv preprint arXiv:2506.05256 , year=

work page arXiv
[17]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

L1: Controlling how long a reasoning model thinks with reinforcement learning , author=. arXiv preprint arXiv:2503.04697 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework , author=. arXiv preprint arXiv:2405.11143 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen2.5-Math , year =

work page
[25]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning , author=. arXiv preprint arXiv:2505.11896 , year=

work page arXiv
[27]

arXiv preprint arXiv:2504.21659 , year=

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization , author=. arXiv preprint arXiv:2504.21659 , year=

work page arXiv

[1] [1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[2] [2]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do not think that much for 2+3=? on the overthinking of o1-like llms , author=. arXiv preprint arXiv:2412.21187 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Reasoning models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

Reasoning models can be effective without thinking , author=. arXiv preprint arXiv:2504.09858 , year=

work page arXiv

[7] [7]

Pencil: Long thoughts with short memory, 2025

Pencil: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv

[8] [8]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Reasoning with Exploration: An Entropy Perspective

Reasoning with Exploration: An Entropy Perspective , author=. arXiv preprint arXiv:2506.14758 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2503.21961 , year=

Entropy-Aware Branching for Improved Mathematical Reasoning , author=. arXiv preprint arXiv:2503.21961 , year=

work page arXiv

[12] [12]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2505.21178 , year=

Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning , author=. arXiv preprint arXiv:2505.21178 , year=

work page arXiv

[15] [15]

and Zuo, C

Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models , author=. arXiv preprint arXiv:2504.09696 , year=

work page arXiv

[16] [16]

Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning , author=. arXiv preprint arXiv:2506.05256 , year=

work page arXiv

[17] [17]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

L1: Controlling how long a reasoning model thinks with reinforcement learning , author=. arXiv preprint arXiv:2503.04697 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework , author=. arXiv preprint arXiv:2405.11143 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen2.5-Math , year =

work page

[25] [25]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning , author=. arXiv preprint arXiv:2505.11896 , year=

work page arXiv

[27] [27]

arXiv preprint arXiv:2504.21659 , year=

Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization , author=. arXiv preprint arXiv:2504.21659 , year=

work page arXiv