When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Haoqing Wang; Wei Xia; Yehui Tang; Zhi-Hong Deng

arxiv: 2605.22873 · v1 · pith:PV5N7DNTnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Wei Xia , Haoqing Wang , Zhi-Hong Deng , Yehui Tang This is my paper

Pith reviewed 2026-05-25 06:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords entropy dynamicschain-of-thoughtLLM reasoningphase transitionadaptive routingtoken efficiencydecoding controlmanifold embedding

0 comments

The pith

Early entropy reduction signals when chain-of-thought reasoning improves LLM output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the benefit of explicit reasoning steps in LLMs is not a fixed property of the task or model but appears as a dynamic state during token generation. Tasks that gain from chain-of-thought show steady entropy decline in the first few steps, while tasks that do not show flat or rising entropy. This pattern is read as a shift from a high-entropy exploratory regime to a low-entropy structured regime. The authors introduce a lightweight router that reads these early signals to decide whether to invoke reasoning or answer directly, producing large token savings and modest accuracy gains across many benchmarks.

Core claim

LLM reasoning is a dynamic decoding state that emerges during generation rather than a static task property; it is marked by consistent early-stage entropy reduction interpretable as a phase-transition-like move from high-entropy exploration to low-entropy structured output, and this signal can be embedded in a compact manifold to route adaptively between direct and chain-of-thought strategies without training.

What carries the argument

Entropy Dynamics-based Reasoning Manifold (EDRM), which embeds early decoding entropy trajectories into a compact manifold representation to route between inference strategies.

If this is right

Dataset-level routing yields 41-55 percent token reduction while improving accuracy.
Instance-level routing adds up to 4.7 percent accuracy while retaining 27-45 percent token savings.
Only about 50 calibration samples suffice for effective zero-shot routing.
The same entropy-based selection works across 15 benchmarks and four LLMs of different scales.
Reasoning steps should be invoked only on instances whose entropy trajectory shows the expected reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the entropy signal holds for agentic or tool-use loops, similar early routing could reduce wasted steps in multi-turn settings.
The manifold view might let researchers map which internal representations align with productive versus unproductive reasoning paths.
Extending the same early-trajectory test to non-CoT structured methods such as self-consistency or tree search would check whether the phase-transition idea generalizes.

Load-bearing premise

Early entropy trajectories reliably indicate whether adding explicit reasoning steps will raise final accuracy on that instance.

What would settle it

A test set of instances whose early entropy trajectories match the reported patterns yet produce accuracy changes opposite to the claimed direction would falsify the routing rule.

Figures

Figures reproduced from arXiv: 2605.22873 by Haoqing Wang, Wei Xia, Yehui Tang, Zhi-Hong Deng.

**Figure 2.** Figure 2: Unified Gain Heatmap. Each cell shows the unified gain of CoT-Direct for instances in a specific region of the (Vsp/avnr, SH) space. 3 Methodology In this section, we first present preliminary concepts and then our observations and insights about LLM decoding dynamics and their relationship to reasoning utility in our exploring investigation. Finally, we introduce EDRM, a novel framework that leverages ear… view at source ↗

**Figure 3.** Figure 3: Entropy Trajectories: average token-level entropy over the first [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comprehensive comparison of CoT and Direct decoding strategies across all evaluated benchmarks and models. Each subplot corresponds to a specific model, with bars representing the accuracy of CoT and Direct on each benchmark. The performance differences are analyzed to identify patterns of CoT gains or losses across different tasks and model sizes. models struggle (e.g., gpqa). This indicates a strong gene… view at source ↗

**Figure 5.** Figure 5: Entropy trajectories for all evaluated models and benchmarks. Each line repre [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Unified Gain Heatmap with various λ (0.03,0.05,0.07,0.10) on base models. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Unified Gain Heatmap with various λ (0.03,0.05,0.07,0.10) on think models. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable entropy-based router for skipping CoT on some tasks and reports solid token savings, but the phase-transition claim is just a label on observed correlations.

read the letter

This paper's main point is that early entropy trajectories during LLM decoding can flag when chain-of-thought will help, and the authors turn the observation into EDRM, a training-free router that embeds those trajectories in a low-dimensional manifold and decides per instance or per dataset whether to reason explicitly. They report 41-55% token reduction at dataset level and up to 4.7% accuracy gain at instance level across 15 benchmarks and four models, using only 50 calibration samples. That coverage and the practical numbers are the clearest contribution. The experiments appear to test multiple scales and architectures, which gives the results some weight. The zero-shot and fine-grained adaptation modes are also straightforward to implement if the manifold construction holds up. The soft spot is the missing causal link. The patterns are consistent—entropy drops when CoT helps and stays high or rises otherwise—but nothing isolates entropy dynamics from task difficulty, output length, or model-specific token statistics. The phase-transition language is descriptive rather than derived; there are no interventions or dynamical equations shown to support treating the shift as a controllable regime change. With only 50 samples the manifold could easily latch onto dataset proxies instead of a general signal. Readers focused on efficient inference and adaptive decoding will find the routing results worth examining. The empirical scope is broad enough that a serious referee should see it, though the authors will need to add controls and clearer method details before the framing can be taken at face value.

Referee Report

3 major / 3 minor

Summary. The paper claims that LLM reasoning emerges as a dynamical decoding state detectable via early-stage entropy trajectories, which exhibit consistent reduction on tasks benefiting from CoT (interpreted as a high-to-low entropy phase transition) but unstable or increasing patterns otherwise. It introduces EDRM, a training-free routing method that embeds these trajectories into a compact manifold for adaptive inference strategy selection, reporting 41-55% token reduction at the dataset level and up to 4.7% accuracy gains at the instance level across 15 benchmarks and 4 LLMs, using as few as 50 calibration samples.

Significance. If the entropy signal is shown to be causal rather than correlational and generalizes beyond the evaluated setups, the work could enable more efficient LLM inference by selectively invoking reasoning only when beneficial, with the training-free and manifold-based design offering practical advantages for deployment. The observational patterns across multiple models and benchmarks provide a useful starting point for entropy-driven control.

major comments (3)

[§3] The central claim that early entropy trajectories 'reliably indicate' whether explicit reasoning improves performance (abstract and §3) rests on observational correlations across benchmarks without a causal test, such as an intervention that perturbs entropy while holding task type fixed; this leaves open whether the signal drives benefit or proxies task difficulty/output length.
[§4] The phase-transition interpretation in §4 lacks a formal criterion (e.g., critical exponent, order parameter, or bifurcation condition) to distinguish a true transition from gradual entropy decay; without this, the dynamical-systems framing risks being descriptive rather than predictive.
[Table 3, §5.2] EDRM's zero-shot routing performance (Table 3 and §5.2) is reported with 41-55% token savings, but the manifold embedding trained on 50 samples shows no ablation on sample selection strategy or sensitivity analysis, raising the possibility that gains derive from task clustering rather than entropy dynamics per se.

minor comments (3)

[Eq. (3)] Notation for entropy trajectory embedding (Eq. 3) uses an undefined manifold dimension parameter; clarify its selection procedure.
[Figure 2] Figure 2 caption does not specify the number of runs or variance for the plotted entropy curves.
[Abstract] The abstract states 'up to 4.7%' accuracy improvement without noting whether this is relative or absolute and on which specific benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [§3] The central claim that early entropy trajectories 'reliably indicate' whether explicit reasoning improves performance (abstract and §3) rests on observational correlations across benchmarks without a causal test, such as an intervention that perturbs entropy while holding task type fixed; this leaves open whether the signal drives benefit or proxies task difficulty/output length.

Authors: The manuscript presents extensive observational evidence from 15 benchmarks and 4 LLMs demonstrating that early entropy trajectories correlate strongly with the benefit of explicit reasoning. We agree that this does not constitute a causal demonstration. In the revised manuscript, we will update §3 to describe the signal as a reliable correlational indicator and add a paragraph discussing the potential for the entropy pattern to reflect task difficulty or output characteristics. We will also propose interventional experiments as future work. revision: yes
Referee: [§4] The phase-transition interpretation in §4 lacks a formal criterion (e.g., critical exponent, order parameter, or bifurcation condition) to distinguish a true transition from gradual entropy decay; without this, the dynamical-systems framing risks being descriptive rather than predictive.

Authors: We view the phase-transition terminology as a useful analogy for the observed shift in entropy dynamics rather than a literal application of dynamical systems theory. We will revise §4 to clarify this and provide an operational definition of the transition as the point where entropy begins a consistent decline within the first 10-20 tokens, based on the empirical data. This will make the framing more precise and highlight its predictive utility for the EDRM router. revision: yes
Referee: [Table 3, §5.2] EDRM's zero-shot routing performance (Table 3 and §5.2) is reported with 41-55% token savings, but the manifold embedding trained on 50 samples shows no ablation on sample selection strategy or sensitivity analysis, raising the possibility that gains derive from task clustering rather than entropy dynamics per se.

Authors: The manifold is constructed solely from entropy trajectory embeddings without using task labels, which helps mitigate the concern of task clustering. However, we acknowledge the absence of sensitivity analysis on the calibration sample selection. We will add to the revised version an analysis of performance with varying numbers of calibration samples and different selection methods to confirm the robustness of the entropy-based manifold. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations and lightweight routing method are independent of inputs

full rationale

The paper reports observational patterns in early entropy trajectories across 15 benchmarks and 4 LLMs, then introduces EDRM as a new embedding-based routing framework using those patterns. No equations, self-citations, or derivations are present that reduce any claimed prediction or phase-transition result to fitted parameters or prior author work by construction. The 50-sample calibration is described as minimal adaptation for manifold embedding rather than a self-referential fit that forces accuracy or token-reduction gains. Results are externally validated on held-out data, satisfying the criteria for a self-contained, non-circular analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; full manuscript required for ledger construction.

pith-pipeline@v0.9.0 · 5851 in / 1226 out tokens · 56570 ms · 2026-05-25T06:11:41.692282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 9 internal anchors

[1]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

EvolveRouter: Co-evolving routing and prompt for multi-agent question answering , author=. arXiv preprint arXiv:2604.05149 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2409.12183 , year=

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning , author=. arXiv preprint arXiv:2409.12183 , year=

work page arXiv
[3]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page
[4]

ArXiv , year=

Solving General Arithmetic Word Problems , author=. ArXiv , year=

work page
[5]

ArXiv , year=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. ArXiv , year=

work page
[6]

AAAI Conference on Artificial Intelligence , year=

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , author=. AAAI Conference on Artificial Intelligence , year=

work page
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Folio: Natural language reasoning with first-order logic , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[10]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Disentangling logic: The role of context in large language model reasoning capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[11]

ArXiv , year=

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. ArXiv , year=

work page
[12]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Transactions of the Association for Computational Linguistics , year=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. Transactions of the Association for Computational Linguistics , year=

work page
[16]

International Conference on Learning Representations , volume=

Musr: Testing the limits of chain-of-thought with multistep soft reasoning , author=. International Conference on Learning Representations , volume=

work page
[17]

Social IQa: Commonsense reasoning about social interactions , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[18]

AAAI Conference on Artificial Intelligence , year=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. AAAI Conference on Artificial Intelligence , year=

work page
[19]

ArXiv , year=

Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models , author=. ArXiv , year=

work page
[20]

Nature , year=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , year=

work page
[21]

International Conference on Learning Representations , year=

Chain of thought empowers transformers to solve inherently serial problems , author=. International Conference on Learning Representations , year=

work page
[22]

arXiv preprint arXiv:2510.03366 , year=

Disentangling recall and reasoning in transformer models through layer-wise attention and activation analysis , author=. arXiv preprint arXiv:2510.03366 , year=

work page arXiv
[23]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

arXiv preprint arXiv:2410.21333 , year=

Mind your step (by Step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse , author=. arXiv preprint arXiv:2410.21333 , year=

work page arXiv
[25]

arXiv preprint arXiv:2512.23765 , year=

Entropy-aware speculative decoding toward improved llm reasoning , author=. arXiv preprint arXiv:2512.23765 , year=

work page arXiv
[26]

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models , author=. arXiv preprint arXiv:2502.19918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought reasoning without prompting , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

arXiv preprint arXiv:2506.05901 , year=

Route-and-reason: Scaling large language model reasoning with reinforced model router , author=. arXiv preprint arXiv:2506.05901 , year=

work page arXiv
[30]

arXiv preprint arXiv:2509.25426 , year=

RADAR: Reasoning-ability and difficulty-aware routing for reasoning llms , author=. arXiv preprint arXiv:2509.25426 , year=

work page arXiv
[31]

arXiv preprint arXiv:2506.22716 , year=

Best-route: Adaptive llm routing with test-time optimal compute , author=. arXiv preprint arXiv:2506.22716 , year=

work page arXiv
[32]

arXiv preprint arXiv:2510.08146 , year=

Think just enough: Sequence-level entropy as a confidence signal for llm reasoning , author=. arXiv preprint arXiv:2510.08146 , year=

work page arXiv
[33]

arXiv preprint arXiv:2505.14631 , year=

Think only when you need with large hybrid-reasoning models , author=. arXiv preprint arXiv:2505.14631 , year=

work page arXiv
[34]

arXiv preprint arXiv:2509.21164 , year=

Mixture of thoughts: Learning to aggregate what experts think, not just what they say , author=. arXiv preprint arXiv:2509.21164 , year=

work page arXiv
[35]

Association for Computational Linguistics , year=

Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference , author=. Association for Computational Linguistics , year=

work page
[36]

arXiv preprint arXiv:2602.13517 , year=

Think deep, not just long: measuring llm reasoning effort via deep-thinking tokens , author=. arXiv preprint arXiv:2602.13517 , year=

work page arXiv
[37]

arXiv preprint arXiv:2603.18940 , year=

Entropy trajectory shape predicts llm reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought , author=. arXiv preprint arXiv:2603.18940 , year=

work page arXiv
[38]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Well, keep thinking

“Well, keep thinking”: Enhancing llm reasoning with adaptive injection decoding , author=. Association for Computational Linguistics , pages=

work page
[40]

arXiv preprint arXiv:2602.01288 , year=

EDIS: Diagnosing llm reasoning via entropy dynamics , author=. arXiv preprint arXiv:2602.01288 , year=

work page arXiv
[41]

Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

Task-aware llm routing with multi-level task-profile-guided data synthesis for cold-start scenarios , author=. arXiv preprint arXiv:2604.09377 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

arXiv preprint arXiv:2501.15602 , year=

Rethinking external slow-thinking: From snowball errors to probability of correct reasoning , author=. arXiv preprint arXiv:2501.15602 , year=

work page arXiv
[43]

Proceedings of the ACM on Web Conference 2025 , pages=

Division-of-thoughts: Harnessing hybrid language model synergy for efficient on-device agents , author=. Proceedings of the ACM on Web Conference 2025 , pages=. 2025 , url=

work page 2025
[44]

arXiv preprint arXiv:2509.24377 , year=

Plan before solving: Problem-aware strategy routing for mathematical reasoning with llms , author=. arXiv preprint arXiv:2509.24377 , year=

work page arXiv
[45]

arXiv preprint arXiv:2604.00018 , year=

Think twice before you write--an entropy-based decoding strategy to enhance llm reasoning , author=. arXiv preprint arXiv:2604.00018 , year=

work page arXiv

[1] [1]

EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

EvolveRouter: Co-evolving routing and prompt for multi-agent question answering , author=. arXiv preprint arXiv:2604.05149 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2409.12183 , year=

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning , author=. arXiv preprint arXiv:2409.12183 , year=

work page arXiv

[3] [3]

ArXiv , year=

Training Verifiers to Solve Math Word Problems , author=. ArXiv , year=

work page

[4] [4]

ArXiv , year=

Solving General Arithmetic Word Problems , author=. ArXiv , year=

work page

[5] [5]

ArXiv , year=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. ArXiv , year=

work page

[6] [6]

AAAI Conference on Artificial Intelligence , year=

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , author=. AAAI Conference on Artificial Intelligence , year=

work page

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Folio: Natural language reasoning with first-order logic , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[10] [10]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Disentangling logic: The role of context in large language model reasoning capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025

[11] [11]

ArXiv , year=

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. ArXiv , year=

work page

[12] [12]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Transactions of the Association for Computational Linguistics , year=

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , author=. Transactions of the Association for Computational Linguistics , year=

work page

[16] [16]

International Conference on Learning Representations , volume=

Musr: Testing the limits of chain-of-thought with multistep soft reasoning , author=. International Conference on Learning Representations , volume=

work page

[17] [17]

Social IQa: Commonsense reasoning about social interactions , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019

[18] [18]

AAAI Conference on Artificial Intelligence , year=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. AAAI Conference on Artificial Intelligence , year=

work page

[19] [19]

ArXiv , year=

Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models , author=. ArXiv , year=

work page

[20] [20]

Nature , year=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , year=

work page

[21] [21]

International Conference on Learning Representations , year=

Chain of thought empowers transformers to solve inherently serial problems , author=. International Conference on Learning Representations , year=

work page

[22] [22]

arXiv preprint arXiv:2510.03366 , year=

Disentangling recall and reasoning in transformer models through layer-wise attention and activation analysis , author=. arXiv preprint arXiv:2510.03366 , year=

work page arXiv

[23] [23]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [24]

arXiv preprint arXiv:2410.21333 , year=

Mind your step (by Step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse , author=. arXiv preprint arXiv:2410.21333 , year=

work page arXiv

[25] [25]

arXiv preprint arXiv:2512.23765 , year=

Entropy-aware speculative decoding toward improved llm reasoning , author=. arXiv preprint arXiv:2512.23765 , year=

work page arXiv

[26] [26]

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models , author=. arXiv preprint arXiv:2502.19918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought reasoning without prompting , author=. Advances in Neural Information Processing Systems , volume=

work page

[29] [29]

arXiv preprint arXiv:2506.05901 , year=

Route-and-reason: Scaling large language model reasoning with reinforced model router , author=. arXiv preprint arXiv:2506.05901 , year=

work page arXiv

[30] [30]

arXiv preprint arXiv:2509.25426 , year=

RADAR: Reasoning-ability and difficulty-aware routing for reasoning llms , author=. arXiv preprint arXiv:2509.25426 , year=

work page arXiv

[31] [31]

arXiv preprint arXiv:2506.22716 , year=

Best-route: Adaptive llm routing with test-time optimal compute , author=. arXiv preprint arXiv:2506.22716 , year=

work page arXiv

[32] [32]

arXiv preprint arXiv:2510.08146 , year=

Think just enough: Sequence-level entropy as a confidence signal for llm reasoning , author=. arXiv preprint arXiv:2510.08146 , year=

work page arXiv

[33] [33]

arXiv preprint arXiv:2505.14631 , year=

Think only when you need with large hybrid-reasoning models , author=. arXiv preprint arXiv:2505.14631 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2509.21164 , year=

Mixture of thoughts: Learning to aggregate what experts think, not just what they say , author=. arXiv preprint arXiv:2509.21164 , year=

work page arXiv

[35] [35]

Association for Computational Linguistics , year=

Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference , author=. Association for Computational Linguistics , year=

work page

[36] [36]

arXiv preprint arXiv:2602.13517 , year=

Think deep, not just long: measuring llm reasoning effort via deep-thinking tokens , author=. arXiv preprint arXiv:2602.13517 , year=

work page arXiv

[37] [37]

arXiv preprint arXiv:2603.18940 , year=

Entropy trajectory shape predicts llm reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought , author=. arXiv preprint arXiv:2603.18940 , year=

work page arXiv

[38] [38]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=

work page

[39] [39]

Well, keep thinking

“Well, keep thinking”: Enhancing llm reasoning with adaptive injection decoding , author=. Association for Computational Linguistics , pages=

work page

[40] [40]

arXiv preprint arXiv:2602.01288 , year=

EDIS: Diagnosing llm reasoning via entropy dynamics , author=. arXiv preprint arXiv:2602.01288 , year=

work page arXiv

[41] [41]

Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios

Task-aware llm routing with multi-level task-profile-guided data synthesis for cold-start scenarios , author=. arXiv preprint arXiv:2604.09377 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

arXiv preprint arXiv:2501.15602 , year=

Rethinking external slow-thinking: From snowball errors to probability of correct reasoning , author=. arXiv preprint arXiv:2501.15602 , year=

work page arXiv

[43] [43]

Proceedings of the ACM on Web Conference 2025 , pages=

Division-of-thoughts: Harnessing hybrid language model synergy for efficient on-device agents , author=. Proceedings of the ACM on Web Conference 2025 , pages=. 2025 , url=

work page 2025

[44] [44]

arXiv preprint arXiv:2509.24377 , year=

Plan before solving: Problem-aware strategy routing for mathematical reasoning with llms , author=. arXiv preprint arXiv:2509.24377 , year=

work page arXiv

[45] [45]

arXiv preprint arXiv:2604.00018 , year=

Think twice before you write--an entropy-based decoding strategy to enhance llm reasoning , author=. arXiv preprint arXiv:2604.00018 , year=

work page arXiv