Learning to Think from Multiple Thinkers

Gal Vardi; Nathan Srebro; Nikolaos Tsilivis; Nirmit Joshi; Roey Magen

arxiv: 2604.24737 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.AI· cs.CC· stat.ML

Learning to Think from Multiple Thinkers

Nirmit Joshi , Roey Magen , Nathan Srebro , Nikolaos Tsilivis , Gal Vardi This is my paper

Pith reviewed 2026-05-08 03:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CCstat.ML

keywords chain of thoughtactive learningmultiple teacherssample complexitycomputational learning theorycryptographic assumptionsend-result supervision

0 comments

The pith

Active learning combines CoT from multiple thinkers to achieve high accuracy using constant detailed data per thinker and a slowly growing number of thinkers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper considers concept classes that are easy to learn when given Chain-of-Thought traces from one thinker but hard when given only final answers. It proves that, under cryptographic assumptions, collecting traces passively from even a small number of thinkers remains computationally intractable. The authors then present an active algorithm that queries chosen thinkers on chosen examples. This procedure reaches any target accuracy using an amount of CoT per thinker that stays fixed no matter how small the error, a number of thinkers that grows only as log(1/ε) times log log(1/ε), and a supply of passive final-answer labels that scales as 1/ε times a polylog factor. The result separates the hardness of passive multi-thinker supervision from what active querying can achieve.

Core claim

The paper shows that there exists a computationally efficient active learning algorithm which, given access to multiple thinkers each supplying correct but systematically different step-by-step solutions, learns any target concept using an amount of CoT data per thinker independent of the accuracy parameter ε, Θ(log(1/ε) log log(1/ε)) thinkers in total, and O((1/ε) poly log(1/ε)) passive end-result samples.

What carries the argument

An active querying procedure that selects which thinker to ask for a Chain-of-Thought explanation on which unlabeled example, thereby distributing the detailed supervision across thinkers while relying mostly on cheap passive final-answer data.

If this is right

CoT supervision per thinker can remain bounded even as the target error approaches zero.
The number of distinct thinkers required grows only logarithmically with the inverse of the desired accuracy.
The majority of training data can be ordinary passive labels rather than expensive step-by-step traces.
The same algorithm applies to any class already known to be learnable from one thinker.
Passive collection of multi-thinker CoT is provably intractable for some classes while active collection is tractable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same active strategy could be tested on real language-model thinkers that produce different solution styles for the same math problems.
Diversity across thinkers may substitute for large volumes of data from any single source in other explanation-based learning settings.
The approach suggests a route to combine traces from humans and models without requiring exhaustive annotation from every source.
If the cryptographic hardness for passive settings can be removed, the active algorithm might still offer practical savings in annotation cost.

Load-bearing premise

There exist concept classes that are easy to learn from one thinker's step-by-step traces but hard from final answers alone, and passive learning from a few such thinkers is computationally hard for cryptographic reasons.

What would settle it

An explicit concept class that is learnable from single-thinker CoT yet for which every active algorithm either requires CoT data per thinker that grows with 1/ε or needs more than log(1/ε) log log(1/ε) thinkers to reach accuracy ε.

Figures

Figures reproduced from arXiv: 2604.24737 by Gal Vardi, Nathan Srebro, Nikolaos Tsilivis, Nirmit Joshi, Roey Magen.

**Figure 1.** Figure 1: A visual help on how a linear threshold is applied autoregressively. Tractable Consistent Oracle. In a CoT supervision, the learner additionally receives z (i) = f CoT-T ∗ (x (i) ) for some f∗ ∈ F with f e2e-T ∗ = h∗. Clearly, observing CoTs as additional supervision can only make the learning problem easier. A central message of [RS88, Mal23, JVB+25] is that the availability of CoT from a single f∗ ∈ F ca… view at source ↗

**Figure 2.** Figure 2: Active and Adaptive protocol for actively collecting CoTs from multiple thinkers, while requesting at most M⋆ non-adaptive CoTs from any single one. 6While we were working on other results, the authors of [RHD+26] shared with us their active learning setup and the fact that they are able to show learning with only poly log ε −1 active CoT queries. 10 view at source ↗

**Figure 3.** Figure 3: (Left) The construction of [KS09], which encodes the decryption functions of Regev’s [Reg09] cryptosystem using intersections of degree-2 PTFs. (Right) Our construction, which encodes the same functions using unions of degree-2 PTFs. CoT supervision would imply breaking the cryptosystem. A similar construction is also carried out under Assumption 2, assuming the existence of local PRGs with polynomial stre… view at source ↗

**Figure 4.** Figure 4: Test accuracy of a transformer trained on sequences encoding noiseless parities (d = 100, k ∈ {50, 75, 90}) versus training iterations. Left: Single-thinker supervision (a unique CoT permutation across training samples). Right: Multiple-thinker supervision (each CoT comes from a randomly sampled permutation). We show the mean and standard deviation across three random seeds per configuration. Note the diff… view at source ↗

**Figure 5.** Figure 5: Sample complexity (defined as number of samples required to reach > 95% test accuracy) of a transformer trained on sequences encoding noiseless parities (d = 100) versus support size k. We compare the single thinker to the multiple thinkers case. References [AB02] Martin Anthony and Peter L. Bartlett. Neural Network Learning - Theoretical Foundations. Cambridge University Press, 2002. 5 [ABCM25] Benny App… view at source ↗

read the original abstract

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is a generic active-learning algorithm that keeps CoT supervision per thinker independent of target accuracy while using only logarithmic thinkers and standard passive data scaling.

read the letter

The main thing to know is that this work separates passive hardness from an efficient active solution for multi-thinker CoT. Under crypto assumptions, even two or a few thinkers make passive learning hard for classes that are easy with single-thinker CoT. The active algorithm then uses a fixed small amount of CoT per thinker (no dependence on ε), a number of thinkers that grows as log(1/ε) log log(1/ε), and passive end-result data scaling as 1/ε times polylog factors. That decoupling looks new relative to the single-thinker results they cite from their own prior paper.

Referee Report

0 major / 3 minor

Summary. The paper studies learning with Chain-of-Thought (CoT) supervision from multiple thinkers who provide correct but systematically different solutions. It shows that, under cryptographic assumptions, passive learning from CoT of two or a few thinkers can be hard even for classes that are easy to learn from single-thinker CoT (building on Joshi et al. 2025). It then presents a generic computationally efficient active learning algorithm achieving target accuracy ε with CoT data per thinker independent of ε, a number of thinkers scaling as log(1/ε) log log(1/ε), and passive end-result data scaling as (1/ε) · polylog(1/ε).

Significance. If the algorithmic guarantee and hardness separation hold, the work provides a clean theoretical separation between passive hardness (under crypto assumptions) and active feasibility for multi-thinker CoT learning. The generic active algorithm with ε-independent CoT per thinker and only polylog(1/ε) thinkers is a notable strength, as is the explicit scaling for passive end-result data. This could inform practical strategies for leveraging diverse expert reasoning traces when end-result supervision alone is insufficient.

minor comments (3)

Abstract: the scaling expression 'log 1/ε log log 1/ε' should be parenthesized as log(1/ε) log log(1/ε) for readability; similarly for the passive data bound.
Abstract: the citation 'Joshi et al. 2025' appears to reference a forthcoming or concurrent work; ensure the full manuscript includes a complete reference entry and clarifies the precise hardness result being extended.
Abstract: the phrase 'a small amount of CoT data per thinker that is completely independent of the target accuracy ε' is a strong claim; confirm that the full paper states the exact (constant or polylog) dependence explicitly in the theorem statement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The referee's description accurately captures our results on the cryptographic hardness of passive multi-thinker CoT learning and the active algorithm with the stated sample complexities.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core positive result is a generic active learning algorithm whose CoT data requirement per thinker is independent of ε, with thinker count and passive data scaling as stated. This construction is presented as new and does not reduce by definition or fitted parameters to the single-thinker classes or passive hardness results cited from Joshi et al. 2025. The cited prior work supplies only the existence of suitable function classes and the cryptographic separation for passive multi-thinker learning; these are external assumptions rather than self-referential inputs that the active algorithm merely renames or re-derives. No load-bearing step in the provided derivation chain equates the claimed bounds to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard cryptographic hardness assumptions for the passive-learning lower bound and on the existence of concept classes with the single-thinker/easy vs. no-CoT/hard property; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Cryptographic assumptions (e.g., one-way functions or similar primitives) that make certain passive learning tasks intractable
Invoked to establish hardness of learning from two or few thinkers in the passive setting.
domain assumption Existence of concept classes that are computationally easy with single-thinker CoT supervision but hard with only end-result labels
Foundation for both the hardness and the algorithmic results.

pith-pipeline@v0.9.0 · 5508 in / 1445 out tokens · 41603 ms · 2026-05-08T03:50:28.893009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024

8, 15, 17, 18, 19, 25, 26, 30, 31, 76, 77, 78 [KS24] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024. 9, 20 [KV94] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata.Journal of the ACM (JACM), 41(1):67–95, 1994. ...

work page arXiv 2024
[2]

25 [Tie24] Stefan Tiegel

PMLR, 2023. 25 [Tie24] Stefan Tiegel. Improved hardness results for learning intersections of halfspaces. In The Thirty Seventh Annual Conference on Learning Theory, pages 4764–4786. PMLR,

work page 2023
[3]

How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025

8, 15, 19, 25, 26, 35 [TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025. 9, 20, 59 [Val84] Leslie G Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134– 1142, 1984. 4 [VSP+17] Ashish Vaswani, Noam...

work page arXiv 2025
[4]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

9 [WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 20 71 A Standard Technical Lemmas Lemma A.1(Bernstein).LetX 1, . . . , Xm be independent real-valued...

work page 2022

[1] [1]

Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024

8, 15, 17, 18, 19, 25, 26, 30, 31, 76, 77, 78 [KS24] Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought.arXiv preprint arXiv:2410.08633, 2024. 9, 20 [KV94] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata.Journal of the ACM (JACM), 41(1):67–95, 1994. ...

work page arXiv 2024

[2] [2]

25 [Tie24] Stefan Tiegel

PMLR, 2023. 25 [Tie24] Stefan Tiegel. Improved hardness results for learning intersections of halfspaces. In The Thirty Seventh Annual Conference on Learning Theory, pages 4764–4786. PMLR,

work page 2023

[3] [3]

How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025

8, 15, 19, 25, 26, 35 [TMUK25] Nikolaos Tsilivis, Eran Malach, Karen Ullrich, and Julia Kempe. How rein- forcement learning after next-token prediction facilitates learning.arXiv preprint arXiv:2510.11495, 2025. 9, 20, 59 [Val84] Leslie G Valiant. A theory of the learnable.Communications of the ACM, 27(11):1134– 1142, 1984. 4 [VSP+17] Ashish Vaswani, Noam...

work page arXiv 2025

[4] [4]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

9 [WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 20 71 A Standard Technical Lemmas Lemma A.1(Bernstein).LetX 1, . . . , Xm be independent real-valued...

work page 2022