pith. sign in

arxiv: 2604.08872 · v2 · pith:TKQ6X5YBnew · submitted 2026-04-10 · 💻 cs.LG · cond-mat.dis-nn· cond-mat.stat-mech

How does Chain of Thought decompose complex tasks?

Pith reviewed 2026-05-22 10:12 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncond-mat.stat-mech
keywords chain of thoughtclassification errorpower lawtask decompositionlanguage modelsreasoning depthtree structureoptimal depth
0
0 comments X

The pith

Decomposing complex tasks into trees of smaller classifications with fixed degree reduces LLM prediction error to a minimum at optimal depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language tasks are modeled as selecting one answer from many possibilities, where the error rate follows a power law that grows with the total number of choices. Breaking the selection into a tree of simpler selections, each involving the same smaller number of options, therefore cuts the overall error dramatically. This structure directly captures how chain-of-thought prompting works by inserting intermediate reasoning steps. Below a critical value for the number of options per step, adding more steps increases error, but above that threshold an optimal number of steps exists that achieves the lowest possible error. Further increases in depth cannot improve on this minimum.

Core claim

The classification error of a large language model scales as a power law in the number of classes. Consequently, any task can be decomposed into a tree of smaller classification subproblems, each with the same fixed number of classes called the degree. This tree models chain-of-thought reasoning. For degrees above a critical threshold, there exists an optimal depth that minimizes the total error, and this minimum cannot be improved by increasing the depth further.

What carries the argument

The power-law scaling of classification error with number of classes, which enables error reduction via tree-structured decomposition with fixed degree.

Load-bearing premise

The classification error scales as a power law in the number of classes.

What would settle it

Measuring the error rate for classifications with varying numbers of classes and finding that it does not follow a power law, or finding that error keeps decreasing with depth without bound.

Figures

Figures reproduced from arXiv: 2604.08872 by Amrut Nadgir, Pratik Chaudhari, Vijay Balasubramanian.

Figure 1
Figure 1. Figure 1: Left: Classification error of a vision transformer scales as a power law with the number of classes m for semantic groupings (obtained by merging or splitting the original CIFAR-100 superclasses), but is roughly constant for random groupings. The fitted power law (dashed red) 0.036 m0.31 + 0.02 corresponds to an intrinsic dimension of d = 6.45, which is close to the prior estimate of d = 15 ± 5 by Pope et … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of regions Ai in the input domain X corresponding to different classes i. With fewer classes (left), regions are well-separated (long arrow). As the number of classes increases (right), regions are packed more densely which reducing the distance between them. This reduced margin forces the learned function to change rapidly, requiring a higher Lipschitz constant. We will next understand how th… view at source ↗
Figure 3
Figure 3. Figure 3: Intrinsic dimension of the sequence (left) and mean entropy of the predicted next-token distribution (right) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Theoretical reasoning gain as a function of the degree [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: A synthetic task where data exhibits a tree structure, see the narrative for a description. Middle: Transformers trained, using the standard next token prediction objective, on such tasks with different degrees and depths of the tree exhibit the smallest error when the degree of the tree is the same at each layer (“structure” equal to 1). Each curve corresponds to a tree of a fixed depth n and degree… view at source ↗
Figure 6
Figure 6. Figure 6: A “thinking” tree corresponding to the one in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Thinking, i.e., using a larger depth than necessary, is detrimental when the degree [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top Left: Test error is minimized at intermediate values of the depth expansion factor r across varying degrees m. The task is the same as in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A repeat of the experiment in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A diagram of the synthetic logical deduction task. Left: [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: illustrates this setup. The model is prompted with the target node from the original tree (e.g., “Target: X5”). It is trained to predict a deeper path through the thinking tree to retrieve the value of this target node. It has multiple reasoning paths it can choose which will give it the same correct answer, as opposed to in the original tree in [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Many language tasks can be modeled as classification problems where a large language model (LLM) is given a prompt and selects one among many possible answers. We show that the classification error in such problems scales as a power law in the number of classes. This has a dramatic consequence: the prediction error can be reduced substantially by splitting the overall task into a sequence of smaller classification problems, each with the same number of classes ("degree"). This tree-structured decomposition models chain-of-thought (CoT). It has been observed that CoT-based predictors perform better when they "think", i.e., when they develop a deeper tree, thus decomposing the problem into a larger number of steps. We identify a critical threshold for the degree, below which thinking is detrimental, and above which there exists an optimal depth that minimizes the error. It is impossible to surpass this minimal error by increasing the depth of thinking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper models language tasks as multi-class classification problems and asserts that the per-step classification error scales as a power law in the number of classes. It represents chain-of-thought as a tree decomposition with fixed branching factor (degree) k and variable depth d, derives a critical threshold on k, and shows that above this threshold an optimal finite depth exists that minimizes total error; the model further implies that this minimum cannot be improved by additional depth.

Significance. If the power-law scaling is valid, the work supplies a clean theoretical account of why and when chain-of-thought improves performance and identifies an interior optimum for thinking depth. The tree-structured error analysis is a useful formalization that could guide future empirical studies of decomposition strategies.

major comments (3)
  1. [§3] The power-law form for single-step error E(k) is introduced as the central modeling assumption and is used both to demonstrate error reduction from decomposition and to locate the optimal depth; however, the manuscript provides neither a derivation from the LLM's token-level probabilities nor empirical measurements across varying class cardinalities. This assumption is load-bearing for the claims of a critical threshold and an unsurpassable minimum error.
  2. [§4.2] §4.2, the derivation of total error as a function of depth d and the subsequent proof that error cannot decrease beyond the optimum both rely on the strict power-law dependence; if the true scaling is sub-power-law, saturating, or task-dependent, neither the existence of an interior minimum nor the impossibility of further improvement follows. A brief sensitivity analysis or alternative functional forms should be included.
  3. [§4.3] The critical threshold on degree k and the optimal depth d* are obtained analytically from the power-law model; because the same functional form is used to establish both the benefit of decomposition and the location of the minimum, the argument risks circularity unless the scaling law is independently justified or validated.
minor comments (2)
  1. [Figure 1] Figure 1 caption should explicitly state whether the plotted error curves are purely theoretical or include any empirical LLM evaluations.
  2. [§2] The notation for total error (perhaps denoted E_total(d,k)) and the precise definition of 'degree' should be introduced with a small concrete example in the model section to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify the power-law scaling as the central modeling assumption. We respond point-by-point below, clarifying its status as an input assumption rather than a derived result, adding a sensitivity analysis, and removing any appearance of circularity. We have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] The power-law form for single-step error E(k) is introduced as the central modeling assumption and is used both to demonstrate error reduction from decomposition and to locate the optimal depth; however, the manuscript provides neither a derivation from the LLM's token-level probabilities nor empirical measurements across varying class cardinalities. This assumption is load-bearing for the claims of a critical threshold and an unsurpassable minimum error.

    Authors: We agree that the power-law E(k) ∝ k^α is a modeling assumption rather than a derivation from token-level probabilities. The manuscript motivates the form by reference to known scaling phenomena in classification error versus class cardinality, but does not derive it from a specific LLM output distribution; such a derivation would require additional assumptions about the model's generative process that lie outside the paper's theoretical scope. We have expanded §3 to state this explicitly and to cite related empirical scaling laws from the literature. A full empirical sweep across class cardinalities is valuable future work but would expand the manuscript beyond its current theoretical focus; we have added a brief discussion of possible experimental tests in the conclusion. revision: partial

  2. Referee: [§4.2] §4.2, the derivation of total error as a function of depth d and the subsequent proof that error cannot decrease beyond the optimum both rely on the strict power-law dependence; if the true scaling is sub-power-law, saturating, or task-dependent, neither the existence of an interior minimum nor the impossibility of further improvement follows. A brief sensitivity analysis or alternative functional forms should be included.

    Authors: We accept this point. The revised manuscript now includes a new sensitivity subsection in §4.2 that examines three alternative forms: logarithmic scaling, linear scaling, and power-law with varying exponents. We show analytically that an interior optimum for depth d* exists whenever the per-step error grows super-linearly with k (including the original power-law regime), while saturating or sub-linear forms yield a minimum at infinite depth. The revised text states the precise conditions under which the critical threshold and finite optimum hold, thereby clarifying the scope of the claims. revision: yes

  3. Referee: [§4.3] The critical threshold on degree k and the optimal depth d* are obtained analytically from the power-law model; because the same functional form is used to establish both the benefit of decomposition and the location of the minimum, the argument risks circularity unless the scaling law is independently justified or validated.

    Authors: We disagree that the argument is circular. The power-law scaling is introduced as an exogenous assumption about single-step classification error; the subsequent derivations of the critical k threshold and the optimal finite depth are logical consequences of that assumption applied to the tree decomposition. This structure is intentional: it isolates the implications of power-law error growth for hierarchical reasoning. To remove any ambiguity we have rewritten the opening of §4.3 to emphasize that the scaling law is an input hypothesis and have added a short paragraph discussing independent empirical routes to validation (e.g., controlled synthetic classification experiments). revision: partial

Circularity Check

0 steps flagged

No significant circularity; power-law scaling serves as independent modeling assumption.

full rationale

The paper states it shows that classification error scales as a power law in the number of classes and then derives consequences for CoT tree decomposition, including a critical degree threshold and an optimal finite depth that minimizes total error. The claim that this minimum cannot be surpassed by further depth increases follows mathematically from minimizing the composite error function E(d, k) under the stated scaling. No quoted step reduces the optimal-depth result or the 'impossible to surpass' statement to a fitted parameter, a self-citation, or a redefinition of the power-law input itself. The scaling is presented as a shown result whose derivation (if present in the full text) is separate from the subsequent minimization; the central claims are therefore consequences rather than tautological restatements of the inputs. This is the normal case of a modeling paper whose conclusions depend on an explicit assumption without self-referential collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The entire analysis rests on the unverified power-law scaling of classification error; no free parameters or new entities are introduced in the abstract, but the scaling itself functions as the load-bearing domain assumption.

axioms (1)
  • domain assumption Classification error scales as a power law in the number of classes
    Invoked to obtain both the error reduction from decomposition and the existence of an optimal tree depth.

pith-pipeline@v0.9.0 · 5691 in / 1147 out tokens · 35686 ms · 2026-05-22T10:12:39.818040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    ISSN 1095-9203. doi: 10.1126/science.abq1158. Li, Z., Zhong, J., et al. Compressing chain-of-thought in llms via step entropy.arXiv preprint arXiv:2508.03346, 2025b. Lightman, H., Kosaraju, V., et al. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023. Liu, A., Feng, B., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 20...

  2. [2]

    Pointer Sentinel Mixture Models

    ISBN 978-0-521-64298-9. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.CoRR, abs/1609.07843, 2016. Muennighoff, N., Yang, Z., et al. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025. OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024. Park, S., Zhang, Y., Yu, S. X., Beery, S., and Hu...

  3. [3]

    Target: X5

    doi: 10.5281/zenodo.4414861. Woodworth, R. S. and Sells, S. B. An atmosphere effect in formal syllogistic reasoning.Journal of Experimental Psy- chology, 18(4):451–460, 1935. ISSN 0022-1015(Print). doi: 10.1037/h0060520. Wu, Y., Wang, Y., et al. When more is less: Under- standing chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. Xie, ...

  4. [4]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Prepend #### to your final answer

  5. [5]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Prepend #### to your final answer

  6. [6]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Prepend #### to your final answer

  7. [7]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Prepend #### to your final answer

  8. [8]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Then, check each step again to make sure it is correct. Prepend #### to your final answer. Each instruction in this list builds on the previous one. The first level encouraged the model to skip steps i...

  9. [9]

    How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did s...

  10. [10]

    How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. #### 72 \n Weng earns $12 an hour for babysitting. Yesterday, she just did 50 mi...

  11. [11]

    How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May

    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. She sold 48+24 = 72 clips altogether in April and May. Checking, 48+24=72 is correct. #### 72 \n Weng earns...

  12. [12]

    How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. She sold 48 clips in April. The total will be the sum of the clips sold in April...

  13. [13]

    How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May

    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? \n I should calculate how many clips Natalia sold in May. Natalia sold 48/2=24 clips in May. Checking, half of 48 is 24. Checking again, 48 divided by 2 is 24. She sold 48 clips in April. The total wil...

  14. [14]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work. Only show important steps. Output the final answer in a box

  15. [15]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Output the final answer in a box

  16. [16]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Check each step to make sure it is correct. Output the final answer in a box

  17. [17]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly check each step to make sure it is correct. Output the final answer in a box

  18. [18]

    Solve the math problem

    You are a helpful assistant. Solve the math problem. Show your work step by step. Explain each step. Explicitly double-check each step to make sure it is correct. Output the final answer in a box. And the few shot examples for these two datasets are listed below in the same order

  19. [19]

    \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

    Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

  20. [20]

    \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

    Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

  21. [21]

    \end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

    Let \[f(x) = \left\{ \ begin{array}{cl} ax+3, & \text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \ right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at...

  22. [22]

    \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper)

    Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

  23. [23]

    overthink

    Let \[f(x) = \left\{ \begin{array}{cl} ax+3, &\text{ if }x>2, \\ x-5 &\text{ if } -2 \le x \le 2, \\ 2x-b &\text{ if } x <-2. \end{array} \right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper). \n For the piecewise function to be continuous, the cases must meet at $2...

  24. [24]

    This is not chain-of-thought

    “This is not chain-of-thought”, “This is not how reasoning works”, “This is not how thinking is implemented in an LLM” In our theory, CoT-based reasoning is characterized by a sequence of predictions made by a model on intermediate classification sub-tasks. The set of all possible sequences of predictions (chains of thought) can be organized into a tree w...

  25. [25]

    (4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d

    What is the optimal degree and depth in real problems? How can you check the degree m of a given task?Our scaling law in Eq. (4) as well as prior empirical and theoretical results (Bahri et al., 2024; Kaplan et al., 2020) indicate that error should scale as roughly D−1/d. By evaluating the test error of a model while varying the size of the dataset D, one...

  26. [26]

    However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta et al., 2024; Wang et al., 2023)

    How would this analysis change if the LLM were to use tree-of-thought, graph-of-thoughts, self-consistency etc.? We have assumed each reasoning step is sampled from the predicted probability distribution over the next token. However, it is common to use more sophisticated methods to improve the effectiveness of CoT-based reasoning (Yao et al., 2023; Besta...