How does Chain of Thought decompose complex tasks?

Amrut Nadgir; Pratik Chaudhari; Vijay Balasubramanian

arxiv: 2604.08872 · v2 · pith:TKQ6X5YBnew · submitted 2026-04-10 · 💻 cs.LG · cond-mat.dis-nn· cond-mat.stat-mech

How does Chain of Thought decompose complex tasks?

Amrut Nadgir , Vijay Balasubramanian , Pratik Chaudhari This is my paper

Pith reviewed 2026-05-22 10:12 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncond-mat.stat-mech

keywords chain of thoughtclassification errorpower lawtask decompositionlanguage modelsreasoning depthtree structureoptimal depth

0 comments

The pith

Decomposing complex tasks into trees of smaller classifications with fixed degree reduces LLM prediction error to a minimum at optimal depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language tasks are modeled as selecting one answer from many possibilities, where the error rate follows a power law that grows with the total number of choices. Breaking the selection into a tree of simpler selections, each involving the same smaller number of options, therefore cuts the overall error dramatically. This structure directly captures how chain-of-thought prompting works by inserting intermediate reasoning steps. Below a critical value for the number of options per step, adding more steps increases error, but above that threshold an optimal number of steps exists that achieves the lowest possible error. Further increases in depth cannot improve on this minimum.

Core claim

The classification error of a large language model scales as a power law in the number of classes. Consequently, any task can be decomposed into a tree of smaller classification subproblems, each with the same fixed number of classes called the degree. This tree models chain-of-thought reasoning. For degrees above a critical threshold, there exists an optimal depth that minimizes the total error, and this minimum cannot be improved by increasing the depth further.

What carries the argument

The power-law scaling of classification error with number of classes, which enables error reduction via tree-structured decomposition with fixed degree.

Load-bearing premise

The classification error scales as a power law in the number of classes.

What would settle it

Measuring the error rate for classifications with varying numbers of classes and finding that it does not follow a power law, or finding that error keeps decreasing with depth without bound.

Figures

Figures reproduced from arXiv: 2604.08872 by Amrut Nadgir, Pratik Chaudhari, Vijay Balasubramanian.

**Figure 1.** Figure 1: Left: Classification error of a vision transformer scales as a power law with the number of classes m for semantic groupings (obtained by merging or splitting the original CIFAR-100 superclasses), but is roughly constant for random groupings. The fitted power law (dashed red) 0.036 m0.31 + 0.02 corresponds to an intrinsic dimension of d = 6.45, which is close to the prior estimate of d = 15 ± 5 by Pope et … view at source ↗

**Figure 2.** Figure 2: Illustration of regions Ai in the input domain X corresponding to different classes i. With fewer classes (left), regions are well-separated (long arrow). As the number of classes increases (right), regions are packed more densely which reducing the distance between them. This reduced margin forces the learned function to change rapidly, requiring a higher Lipschitz constant. We will next understand how th… view at source ↗

**Figure 3.** Figure 3: Intrinsic dimension of the sequence (left) and mean entropy of the predicted next-token distribution (right) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Theoretical reasoning gain as a function of the degree [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Left: A synthetic task where data exhibits a tree structure, see the narrative for a description. Middle: Transformers trained, using the standard next token prediction objective, on such tasks with different degrees and depths of the tree exhibit the smallest error when the degree of the tree is the same at each layer (“structure” equal to 1). Each curve corresponds to a tree of a fixed depth n and degree… view at source ↗

**Figure 6.** Figure 6: A “thinking” tree corresponding to the one in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Thinking, i.e., using a larger depth than necessary, is detrimental when the degree [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Top Left: Test error is minimized at intermediate values of the depth expansion factor r across varying degrees m. The task is the same as in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: A repeat of the experiment in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: A diagram of the synthetic logical deduction task. Left: [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: illustrates this setup. The model is prompted with the target node from the original tree (e.g., “Target: X5”). It is trained to predict a deeper path through the thinking tree to retrieve the value of this target node. It has multiple reasoning paths it can choose which will give it the same correct answer, as opposed to in the original tree in [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper models CoT as tree decomposition of classification under power-law error, yielding a critical degree and optimal depth, but the scaling assumption carries the main weight.

read the letter

The key point is that the authors treat chain-of-thought as a recursive decomposition of a classification task into a tree of smaller classifications, each with fixed degree k. Assuming the per-step error follows a power law in k, they derive a critical k below which more depth hurts and above which an optimal finite depth exists, with the claim that no amount of extra depth can beat that minimum error. They do a good job of making the decomposition explicit and showing how it captures the thinking aspect of CoT. The model is simple, which makes the critical threshold and optimal depth easy to see as functions of the power-law exponent. This kind of quantitative account is useful because it moves beyond just saying CoT helps and tries to say when and how much. The soft spot is right where the stress-test note says: the power-law scaling of error with number of classes is asserted but not derived from the token probabilities or validated with data on different class counts. Since this form is used to prove both the benefit of decomposition and the location of the minimum, any deviation from power-law behavior would change the picture. The paper might have some empirical checks in the full text, but if they are only fits to existing results rather than independent tests, that leaves the central claim vulnerable. Minor issues like how they handle the base case or leaf errors seem secondary. This paper is for theorists and benchmark designers who want a handle on why step-by-step reasoning works in LLMs. A reader looking for new ways to analyze prompting would get value from the tree model and the threshold idea. It shows honest engagement with the problem and clear modeling choices, so it counts as serious thinking. I would recommend sending it to peer review. The topic matters and the approach is direct enough that referees can engage productively, even if revisions are needed around the scaling assumption.

Referee Report

3 major / 2 minor

Summary. The paper models language tasks as multi-class classification problems and asserts that the per-step classification error scales as a power law in the number of classes. It represents chain-of-thought as a tree decomposition with fixed branching factor (degree) k and variable depth d, derives a critical threshold on k, and shows that above this threshold an optimal finite depth exists that minimizes total error; the model further implies that this minimum cannot be improved by additional depth.

Significance. If the power-law scaling is valid, the work supplies a clean theoretical account of why and when chain-of-thought improves performance and identifies an interior optimum for thinking depth. The tree-structured error analysis is a useful formalization that could guide future empirical studies of decomposition strategies.

major comments (3)

[§3] The power-law form for single-step error E(k) is introduced as the central modeling assumption and is used both to demonstrate error reduction from decomposition and to locate the optimal depth; however, the manuscript provides neither a derivation from the LLM's token-level probabilities nor empirical measurements across varying class cardinalities. This assumption is load-bearing for the claims of a critical threshold and an unsurpassable minimum error.
[§4.2] §4.2, the derivation of total error as a function of depth d and the subsequent proof that error cannot decrease beyond the optimum both rely on the strict power-law dependence; if the true scaling is sub-power-law, saturating, or task-dependent, neither the existence of an interior minimum nor the impossibility of further improvement follows. A brief sensitivity analysis or alternative functional forms should be included.
[§4.3] The critical threshold on degree k and the optimal depth d* are obtained analytically from the power-law model; because the same functional form is used to establish both the benefit of decomposition and the location of the minimum, the argument risks circularity unless the scaling law is independently justified or validated.

minor comments (2)

[Figure 1] Figure 1 caption should explicitly state whether the plotted error curves are purely theoretical or include any empirical LLM evaluations.
[§2] The notation for total error (perhaps denoted E_total(d,k)) and the precise definition of 'degree' should be introduced with a small concrete example in the model section to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify the power-law scaling as the central modeling assumption. We respond point-by-point below, clarifying its status as an input assumption rather than a derived result, adding a sensitivity analysis, and removing any appearance of circularity. We have revised the manuscript accordingly.

read point-by-point responses

Referee: [§3] The power-law form for single-step error E(k) is introduced as the central modeling assumption and is used both to demonstrate error reduction from decomposition and to locate the optimal depth; however, the manuscript provides neither a derivation from the LLM's token-level probabilities nor empirical measurements across varying class cardinalities. This assumption is load-bearing for the claims of a critical threshold and an unsurpassable minimum error.

Authors: We agree that the power-law E(k) ∝ k^α is a modeling assumption rather than a derivation from token-level probabilities. The manuscript motivates the form by reference to known scaling phenomena in classification error versus class cardinality, but does not derive it from a specific LLM output distribution; such a derivation would require additional assumptions about the model's generative process that lie outside the paper's theoretical scope. We have expanded §3 to state this explicitly and to cite related empirical scaling laws from the literature. A full empirical sweep across class cardinalities is valuable future work but would expand the manuscript beyond its current theoretical focus; we have added a brief discussion of possible experimental tests in the conclusion. revision: partial
Referee: [§4.2] §4.2, the derivation of total error as a function of depth d and the subsequent proof that error cannot decrease beyond the optimum both rely on the strict power-law dependence; if the true scaling is sub-power-law, saturating, or task-dependent, neither the existence of an interior minimum nor the impossibility of further improvement follows. A brief sensitivity analysis or alternative functional forms should be included.

Authors: We accept this point. The revised manuscript now includes a new sensitivity subsection in §4.2 that examines three alternative forms: logarithmic scaling, linear scaling, and power-law with varying exponents. We show analytically that an interior optimum for depth d* exists whenever the per-step error grows super-linearly with k (including the original power-law regime), while saturating or sub-linear forms yield a minimum at infinite depth. The revised text states the precise conditions under which the critical threshold and finite optimum hold, thereby clarifying the scope of the claims. revision: yes
Referee: [§4.3] The critical threshold on degree k and the optimal depth d* are obtained analytically from the power-law model; because the same functional form is used to establish both the benefit of decomposition and the location of the minimum, the argument risks circularity unless the scaling law is independently justified or validated.

Authors: We disagree that the argument is circular. The power-law scaling is introduced as an exogenous assumption about single-step classification error; the subsequent derivations of the critical k threshold and the optimal finite depth are logical consequences of that assumption applied to the tree decomposition. This structure is intentional: it isolates the implications of power-law error growth for hierarchical reasoning. To remove any ambiguity we have rewritten the opening of §4.3 to emphasize that the scaling law is an input hypothesis and have added a short paragraph discussing independent empirical routes to validation (e.g., controlled synthetic classification experiments). revision: partial

Circularity Check

0 steps flagged

No significant circularity; power-law scaling serves as independent modeling assumption.

full rationale

The paper states it shows that classification error scales as a power law in the number of classes and then derives consequences for CoT tree decomposition, including a critical degree threshold and an optimal finite depth that minimizes total error. The claim that this minimum cannot be surpassed by further depth increases follows mathematically from minimizing the composite error function E(d, k) under the stated scaling. No quoted step reduces the optimal-depth result or the 'impossible to surpass' statement to a fitted parameter, a self-citation, or a redefinition of the power-law input itself. The scaling is presented as a shown result whose derivation (if present in the full text) is separate from the subsequent minimization; the central claims are therefore consequences rather than tautological restatements of the inputs. This is the normal case of a modeling paper whose conclusions depend on an explicit assumption without self-referential collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The entire analysis rests on the unverified power-law scaling of classification error; no free parameters or new entities are introduced in the abstract, but the scaling itself functions as the load-bearing domain assumption.

axioms (1)

domain assumption Classification error scales as a power law in the number of classes
Invoked to obtain both the error reduction from decomposition and the existence of an optimal tree depth.

pith-pipeline@v0.9.0 · 5691 in / 1147 out tokens · 35686 ms · 2026-05-22T10:12:39.818040+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the classification error in such problems scales as a power law in the number of classes... optimal degree m∗=e^{d/2}... It is impossible to surpass this minimal error by increasing the depth of thinking.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the error is minimized when each sub-problem has a certain optimal degree m∗=e^{d/2}... thinking reduces the error up to a certain depth.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

ISSN 1095-9203. doi: 10.1126/science.abq1158. Li, Z., Zhong, J., et al. Compressing chain-of-thought in llms via step entropy.arXiv preprint arXiv:2508.03346, 2025b. Lightman, H., Kosaraju, V., et al. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. Liu, A., Feng, B., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 20...

work page doi:10.1126/science.abq1158 2023
[2]

Pointer Sentinel Mixture Models

ISBN 978-0-521-64298-9. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.CoRR, abs/1609.07843, 2016. Muennighoff, N., Yang, Z., et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025. OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024. Park, S., Zhang, Y., Yu, S. X., Beery, S., and Hu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.7551/mitpress/11474.001.0001 2016
[3]

Target: X5

doi: 10.5281/zenodo.4414861. Woodworth, R. S. and Sells, S. B. An atmosphere effect in formal syllogistic reasoning.Journal of Experimental Psy- chology, 18(4):451–460, 1935. ISSN 0022-1015(Print). doi: 10.1037/h0060520. Wu, Y., Wang, Y., et al. When more is less: Under- standing chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. Xie, ...

work page doi:10.5281/zenodo.4414861 1935
[4]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Prepend #### to your final answer

work page
[5]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Prepend #### to your final answer

work page
[6]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Prepend #### to your final answer

work page
[7]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Prepend #### to your final answer

work page
[8]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Then, check each step again to make sure it is correct. Prepend #### to your final answer. Each instruction in this list builds on the previous one. The first level encouraged the model to skip steps i...

work page
[9]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did s...

work page
[10]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 mi...

work page
[11]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. Checking, 48+24=72 is correct. #### 72 \n Weng earns...

work page
[12]

How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. The total will be the sum of the clips sold in April...

work page
[13]

How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. Checking again, 48 divided by 2 is 24. She sold 48 clips in April. The total wil...

work page
[14]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Output the final answer in a box

work page
[15]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Output the final answer in a box

work page
[16]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Output the final answer in a box

work page
[17]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Output the final answer in a box

work page
[18]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly double-check each step to make sure it is correct. Output the final answer in a box. And the few shot examples for these two datasets are listed below in the same order

work page
[19]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page
[20]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page
[21]

\end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \ begin{array}{cl} ax+3, & \text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at...

work page
[22]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page
[23]

overthink

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page 2024
[24]

This is not chain-of-thought

“This is not chain-of-thought”, “This is not how reasoning works”, “This is not how thinking is implemented in an LLM” In our theory, CoT-based reasoning is characterized by a sequence of predictions made by a model on intermediate classification sub-tasks. The set of all possible sequences of predictions (chains of thought) can be organized into a tree w...

work page
[25]

(4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d

What is the optimal degree and depth in real problems? How can you check the degree m of a given task?Our scaling law in Eq. (4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d. By evaluating the test error of a model while varying the size of the dataset D, one...

work page 2024
[26]

However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta et al., 2024; Wang et al., 2023)

How would this analysis change if the LLM were to use tree-of-thought, graph-of-thoughts, self-consistency etc.? We have assumed each reasoning step is sampled from the predicted probability distribution over the next token. However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta...

work page 2023

[1] [1]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

ISSN 1095-9203. doi: 10.1126/science.abq1158. Li, Z., Zhong, J., et al. Compressing chain-of-thought in llms via step entropy.arXiv preprint arXiv:2508.03346, 2025b. Lightman, H., Kosaraju, V., et al. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. Liu, A., Feng, B., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 20...

work page doi:10.1126/science.abq1158 2023

[2] [2]

Pointer Sentinel Mixture Models

ISBN 978-0-521-64298-9. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.CoRR, abs/1609.07843, 2016. Muennighoff, N., Yang, Z., et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025. OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024. Park, S., Zhang, Y., Yu, S. X., Beery, S., and Hu...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.7551/mitpress/11474.001.0001 2016

[3] [3]

Target: X5

doi: 10.5281/zenodo.4414861. Woodworth, R. S. and Sells, S. B. An atmosphere effect in formal syllogistic reasoning.Journal of Experimental Psy- chology, 18(4):451–460, 1935. ISSN 0022-1015(Print). doi: 10.1037/h0060520. Wu, Y., Wang, Y., et al. When more is less: Under- standing chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. Xie, ...

work page doi:10.5281/zenodo.4414861 1935

[4] [4]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Prepend #### to your final answer

work page

[5] [5]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Prepend #### to your final answer

work page

[6] [6]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Prepend #### to your final answer

work page

[7] [7]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Prepend #### to your final answer

work page

[8] [8]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Then, check each step again to make sure it is correct. Prepend #### to your final answer. Each instruction in this list builds on the previous one. The first level encouraged the model to skip steps i...

work page

[9] [9]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did s...

work page

[10] [10]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 mi...

work page

[11] [11]

How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. Checking, 48+24=72 is correct. #### 72 \n Weng earns...

work page

[12] [12]

How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. The total will be the sum of the clips sold in April...

work page

[13] [13]

How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. Checking again, 48 divided by 2 is 24. She sold 48 clips in April. The total wil...

work page

[14] [14]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Output the final answer in a box

work page

[15] [15]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Output the final answer in a box

work page

[16] [16]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Output the final answer in a box

work page

[17] [17]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Output the final answer in a box

work page

[18] [18]

Solve the math problem

You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly double-check each step to make sure it is correct. Output the final answer in a box. And the few shot examples for these two datasets are listed below in the same order

work page

[19] [19]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page

[20] [20]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page

[21] [21]

\end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \ begin{array}{cl} ax+3, & \text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at...

work page

[22] [22]

\end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page

[23] [23]

overthink

Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

work page 2024

[24] [24]

This is not chain-of-thought

“This is not chain-of-thought”, “This is not how reasoning works”, “This is not how thinking is implemented in an LLM” In our theory, CoT-based reasoning is characterized by a sequence of predictions made by a model on intermediate classification sub-tasks. The set of all possible sequences of predictions (chains of thought) can be organized into a tree w...

work page

[25] [25]

(4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d

What is the optimal degree and depth in real problems? How can you check the degree m of a given task?Our scaling law in Eq. (4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d. By evaluating the test error of a model while varying the size of the dataset D, one...

work page 2024

[26] [26]

However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta et al., 2024; Wang et al., 2023)

How would this analysis change if the LLM were to use tree-of-thought, graph-of-thoughts, self-consistency etc.? We have assumed each reasoning step is sampled from the predicted probability distribution over the next token. However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta...

work page 2023