Learning to Think from Multiple Thinkers
Pith reviewed 2026-05-08 03:50 UTC · model grok-4.3
The pith
Active learning combines CoT from multiple thinkers to achieve high accuracy using constant detailed data per thinker and a slowly growing number of thinkers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that there exists a computationally efficient active learning algorithm which, given access to multiple thinkers each supplying correct but systematically different step-by-step solutions, learns any target concept using an amount of CoT data per thinker independent of the accuracy parameter ε, Θ(log(1/ε) log log(1/ε)) thinkers in total, and O((1/ε) poly log(1/ε)) passive end-result samples.
What carries the argument
An active querying procedure that selects which thinker to ask for a Chain-of-Thought explanation on which unlabeled example, thereby distributing the detailed supervision across thinkers while relying mostly on cheap passive final-answer data.
If this is right
- CoT supervision per thinker can remain bounded even as the target error approaches zero.
- The number of distinct thinkers required grows only logarithmically with the inverse of the desired accuracy.
- The majority of training data can be ordinary passive labels rather than expensive step-by-step traces.
- The same algorithm applies to any class already known to be learnable from one thinker.
- Passive collection of multi-thinker CoT is provably intractable for some classes while active collection is tractable.
Where Pith is reading between the lines
- The same active strategy could be tested on real language-model thinkers that produce different solution styles for the same math problems.
- Diversity across thinkers may substitute for large volumes of data from any single source in other explanation-based learning settings.
- The approach suggests a route to combine traces from humans and models without requiring exhaustive annotation from every source.
- If the cryptographic hardness for passive settings can be removed, the active algorithm might still offer practical savings in annotation cost.
Load-bearing premise
There exist concept classes that are easy to learn from one thinker's step-by-step traces but hard from final answers alone, and passive learning from a few such thinkers is computationally hard for cryptographic reasons.
What would settle it
An explicit concept class that is learnable from single-thinker CoT yet for which every active algorithm either requires CoT data per thinker that grows with 1/ε or needs more than log(1/ε) log log(1/ε) thinkers to reach accuracy ε.
Figures
read the original abstract
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies learning with Chain-of-Thought (CoT) supervision from multiple thinkers who provide correct but systematically different solutions. It shows that, under cryptographic assumptions, passive learning from CoT of two or a few thinkers can be hard even for classes that are easy to learn from single-thinker CoT (building on Joshi et al. 2025). It then presents a generic computationally efficient active learning algorithm achieving target accuracy ε with CoT data per thinker independent of ε, a number of thinkers scaling as log(1/ε) log log(1/ε), and passive end-result data scaling as (1/ε) · polylog(1/ε).
Significance. If the algorithmic guarantee and hardness separation hold, the work provides a clean theoretical separation between passive hardness (under crypto assumptions) and active feasibility for multi-thinker CoT learning. The generic active algorithm with ε-independent CoT per thinker and only polylog(1/ε) thinkers is a notable strength, as is the explicit scaling for passive end-result data. This could inform practical strategies for leveraging diverse expert reasoning traces when end-result supervision alone is insufficient.
minor comments (3)
- Abstract: the scaling expression 'log 1/ε log log 1/ε' should be parenthesized as log(1/ε) log log(1/ε) for readability; similarly for the passive data bound.
- Abstract: the citation 'Joshi et al. 2025' appears to reference a forthcoming or concurrent work; ensure the full manuscript includes a complete reference entry and clarifies the precise hardness result being extended.
- Abstract: the phrase 'a small amount of CoT data per thinker that is completely independent of the target accuracy ε' is a strong claim; confirm that the full paper states the exact (constant or polylog) dependence explicitly in the theorem statement.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The referee's description accurately captures our results on the cryptographic hardness of passive multi-thinker CoT learning and the active algorithm with the stated sample complexities.
Circularity Check
No significant circularity detected
full rationale
The paper's core positive result is a generic active learning algorithm whose CoT data requirement per thinker is independent of ε, with thinker count and passive data scaling as stated. This construction is presented as new and does not reduce by definition or fitted parameters to the single-thinker classes or passive hardness results cited from Joshi et al. 2025. The cited prior work supplies only the existence of suitable function classes and the cryptographic separation for passive multi-thinker learning; these are external assumptions rather than self-referential inputs that the active algorithm merely renames or re-derives. No load-bearing step in the provided derivation chain equates the claimed bounds to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Cryptographic assumptions (e.g., one-way functions or similar primitives) that make certain passive learning tasks intractable
- domain assumption Existence of concept classes that are computationally easy with single-thinker CoT supervision but hard with only end-result labels
Reference graph
Works this paper leans on
-
[1]
8, 15, 17, 18, 19, 25, 26, 30, 31, 76, 77, 78 [KS24] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024. 9, 20 [KV94] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata.Journal of the ACM (JACM), 41(1):67–95, 1994. ...
-
[2]
PMLR, 2023. 25 [Tie24] Stefan Tiegel. Improved hardness results for learning intersections of halfspaces. In The Thirty Seventh Annual Conference on Learning Theory, pages 4764–4786. PMLR,
work page 2023
-
[3]
8, 15, 19, 25, 26, 35 [TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025. 9, 20, 59 [Val84] Leslie G Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134– 1142, 1984. 4 [VSP+17] Ashish Vaswani, Noam...
-
[4]
9 [WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 20 71 A Standard Technical Lemmas Lemma A.1(Bernstein).LetX 1, . . . , Xm be independent real-valued...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.