On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

Tommaso Cesari; Yongyi Mao; Yue Zhang; Zhiyi Dong

arxiv: 2605.21260 · v1 · pith:SYO4IKLPnew · submitted 2026-05-20 · 💻 cs.LG

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

Yue Zhang , Zhiyi Dong , Tommaso Cesari , Yongyi Mao This is my paper

Pith reviewed 2026-05-21 05:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords chain of thoughtreasoning riskoracle-trajectory risktrajectory-mismatch riskstabilityerror accumulationlearning theorydomain adaptation

0 comments

The pith

Chain-of-thought reasoning risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models chain-of-thought as the interaction between an answer map and an autoregressive chain rule that produces intermediate questions. It defines the reasoning risk of a hypothesis under this interaction and decomposes the risk into an oracle-trajectory risk that captures the benefit of CoT by reducing to a target-domain risk in domain adaptation, and a trajectory-mismatch risk that captures the cost through error accumulation on mismatched paths. The work shows that without stability in the loss, the hypothesis answer map, or the chain rule, the mismatch cost can become arbitrarily large even when the oracle term is zero and the hypothesis is uniformly close to the ground truth. Under stability, a tight upper bound on the mismatch risk is controlled by an exact amplification factor that distinguishes bounded, linear, and exponential error-growth regimes. This supplies a precise account of when chain-of-thought improves accuracy and when it degrades it.

Core claim

We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. Under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear

What carries the argument

The canonical decomposition of reasoning risk into oracle-trajectory risk (OTR) and trajectory-mismatch risk (TMR), which separates CoT's benefit from its cost of error accumulation on mismatched trajectories.

Load-bearing premise

The modeling of CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively.

What would settle it

Construct an unstable chain rule and an accurate hypothesis with zero oracle-trajectory risk, then measure whether the trajectory-mismatch risk grows without bound.

read the original abstract

We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly decomposes CoT risk into an oracle benefit term and a mismatch cost term, with stability setting the error growth rate, but the split rests on treating the chain rule as separable from the answer map.

read the letter

The core contribution is a canonical split of reasoning risk under CoT into oracle-trajectory risk, which reduces to a target-domain risk and captures the benefit, and trajectory-mismatch risk, which tracks error accumulation along wrong intermediate paths. Under a stability assumption they derive a tight upper bound on the mismatch term controlled by an exact amplification factor that distinguishes bounded, linear, and exponential regimes. They also show that without stability the mismatch term can be made arbitrarily large even when the oracle term is zero and the hypothesis is close to truth. This gives a precise account of when CoT helps versus hurts that is not just another empirical observation or loose generalization bound.

Referee Report

1 major / 1 minor

Summary. The paper develops a learning-theoretic framework for understanding Chain of Thought (CoT). It models CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and defines the reasoning risk of a hypothesis under this interaction. The main results are a tight canonical decomposition of this risk into oracle-trajectory risk (OTR) capturing the benefit of CoT (reducing to target-domain risk in domain adaptation) and trajectory-mismatch risk (TMR) capturing the cost through error accumulation. It shows that without stability, TMR can be arbitrarily large even when OTR is zero, and under stability provides a tight upper bound governed by an amplification factor identifying bounded, linear, and exponential error-growth regimes.

Significance. This framework offers a precise theory of when CoT helps or hurts by identifying stability as the key factor. The OTR/TMR decomposition provides clear separation of benefits and costs, with the domain adaptation analogy adding interpretability. The error growth regimes could help predict and mitigate issues in long reasoning chains. These results, if verified, contribute to the theoretical foundations of reasoning in large models.

major comments (1)

[Modeling of CoT and definition of reasoning risk] The canonical decomposition into OTR and TMR is a direct algebraic consequence of the modeling choice where CoT is the interaction between a fixed answer map and an independent autoregressive chain rule. This modeling enables the split but may not reflect the joint training dynamics typical in CoT, where a single model generates both the chain and the answer without explicit separation. Since this is the load-bearing assumption for defining the reasoning risk and enabling the subsequent analysis, the paper would benefit from additional discussion on how the results translate to jointly optimized models or why the separation is a reasonable abstraction.

minor comments (1)

[Notation and definitions] Ensure that all symbols, such as the amplification factor, are clearly defined upon first use and that any assumptions on the hypothesis class are explicitly stated to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, as well as for recognizing the potential of the OTR/TMR framework. We address the single major comment below and will revise the manuscript accordingly to strengthen the discussion of modeling assumptions.

read point-by-point responses

Referee: [Modeling of CoT and definition of reasoning risk] The canonical decomposition into OTR and TMR is a direct algebraic consequence of the modeling choice where CoT is the interaction between a fixed answer map and an independent autoregressive chain rule. This modeling enables the split but may not reflect the joint training dynamics typical in CoT, where a single model generates both the chain and the answer without explicit separation. Since this is the load-bearing assumption for defining the reasoning risk and enabling the subsequent analysis, the paper would benefit from additional discussion on how the results translate to jointly optimized models or why the separation is a reasonable abstraction.

Authors: We agree that the separation of the answer map and chain rule is a deliberate modeling abstraction chosen to enable the clean algebraic decomposition of reasoning risk. This choice is reasonable because the autoregressive generation of intermediate steps followed by a final answer map is the functional form of CoT even when a single model is trained end-to-end; the parameters may be shared, but the roles remain distinct and the risk decomposition continues to hold formally under the same interaction. The framework thereby isolates the benefit (OTR, which reduces to target risk in a domain-adaptation view) from the cost (TMR due to trajectory mismatch), providing insight that remains relevant for jointly optimized models. To address the referee's suggestion, we will add a new paragraph in the Discussion section explaining this rationale, noting that the stability conditions and error-growth regimes apply directly to the composed hypothesis regardless of training procedure, and briefly relating the abstraction to modular versus monolithic reasoning architectures in the literature. This revision clarifies scope without changing any theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines a reasoning risk via the composition of an answer map and an autoregressive chain rule for generating intermediate questions. It then algebraically decomposes this defined quantity into an oracle-trajectory risk (OTR) term and a trajectory-mismatch risk (TMR) term. Subsequent stability-based bounds on TMR follow from additional assumptions on the loss, hypothesis, and chain rule rather than from any fitted parameters, self-citations, or reductions of the target claims to the inputs by construction. The framework remains self-contained as a sequence of definitions, identities, and conditional theorems without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the newly introduced definitions of reasoning risk, OTR, and TMR plus stability assumptions on loss, answer map, and chain rule. These are defined within the paper rather than taken from upstream literature without justification. No numerical free parameters or new physical entities are mentioned.

axioms (1)

standard math Existence of probability distributions over input sequences and output labels for defining expectations in the risk terms.
Invoked when defining reasoning risk and its decomposition into OTR and TMR.

pith-pipeline@v0.9.0 · 5744 in / 1430 out tokens · 51906 ms · 2026-05-21T05:06:23.201969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk... tight canonical decomposition... trajectory-mismatch risk (TMR)... oracle-trajectory risk (OTR)... amplification factor α_K(ϕ, δ)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

under stability... exact amplification factor that identifies bounded, linear, and exponential error-growth regimes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

123 extracted references · 123 canonical work pages · 12 internal anchors

[1]

A. A. Abdullah, A. Zubiaga, S. Mirjalili, A. H. Gandomi, F. Daneshfar, M. Amini, A. S. Mohammed, and H. Veisi. Evolution of meta’s llama models and parameter-efficient fine-tuning of large language models: a survey.arXiv preprint arXiv:2510.12178, 2025

work page arXiv 2025
[2]

Acuna, G

D. Acuna, G. Zhang, M. T. Law, and S. Fidler. f-domain adversarial learning: Theory and algorithms. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 66–75. PMLR, 18–24 Jul 2021

work page 2021
[3]

Aghajohari, K

M. Aghajohari, K. Chitsaz, A. Kazemnejad, S. Chandar, A. Sordoni, A. Courville, and S. Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[4]

Altabaa, O

A. Altabaa, O. Montasser, and J. Lafferty. Cot information: Improved sample complexity under chain- of-thought supervision. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 24822–24862. Curran Associates, Inc., 2025

work page 2025
[5]

Amiri, X

A. Amiri, X. Huang, M. Rofin, and M. Hahn. Lower bounds for chain-of-thought reasoning in hard- attention transformers. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learn- ing, volume 267 ofProceedings of Machine Learning Research, ...

work page 2025
[6]

Claude Opus 4.6 System Card

Anthropic. Claude Opus 4.6 System Card. System card, Anthropic, Feb. 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-06

work page 2026
[7]

Bachmann and V

G. Bachmann and V. Nagarajan. The pitfalls of next-token prediction. InForty-first International Conference on Machine Learning, 2024

work page 2024
[8]

X. Bai, I. Pres, Y. Deng, C. Tan, S. Shieber, F. Vi´ egas, M. Wattenberg, and A. Lee. Why can’t transformers learn multiplication? reverse-engineering reveals long-range dependency pitfalls.arXiv preprint arXiv:2510.00184, 2025

work page arXiv 2025
[9]

G. Bao, H. Zhang, C. Wang, L. Yang, and Y. Zhang. How likely do LLMs with CoT mimic human reasoning? In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831– 7850, Abu Dhabi, UAE, Jan. 2025. Association for Computational Linguistics

work page 2025
[10]

Barcelo, A

P. Barcelo, A. Kozachinskiy, and T. Steifer. Ehrenfeucht-haussler rank and chain of thought. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[11]

Ben-David and R

S. Ben-David and R. Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. InInternational Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012

work page 2012
[12]

Ben-David and R

S. Ben-David and R. Urner. Domain adaptation–can quantity compensate for quality?Annals of Mathematics and Artificial Intelligence, 70(3):185–202, 2014

work page 2014
[13]

Ben-David, J

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adapta- tion. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors,Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006

work page 2006
[14]

Ben-David, J

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains.Machine learning, 79(1):151–175, 2010. 11

work page 2010
[15]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[16]

Besta, F

M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, G. Piao, N. Blach, P. Nyczyk, M. Copik, G. Kwa´ sniewski, J. M¨ uller, et al. Demystifying chains, trees, and graphs of thoughts.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[17]

Blitzer, K

J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adap- tation. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007

work page 2007
[18]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....

work page 1901
[19]

L. Chen, B. Peng, and H. Wu. Theoretical limitations of multi-layer transformer. In2025 IEEE 66th Annual Symposium on Foundations of Computer Science (FOCS), pages 2631–2653, 2025. doi: 10.1109/FOCS63196.2025.00136

work page doi:10.1109/focs63196.2025.00136 2025
[20]

Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems, volume 37, pages 54872–54904. Curran Associates,...

work page doi:10.52202/079017-1740 2024
[21]

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567v5, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning, 2024

work page 2024
[23]

X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu. Do NOT think that much for 2+3=? On the overthinking of long reasoning models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference ...

work page 2025
[24]

Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

J. Cheng and B. V. Durme. Compressed chain of thought: Efficient reasoning through dense represen- tations.CoRR, abs/2412.13171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Cheng, X

Y. Cheng, X. Liang, Y. Gong, W. Xiao, S. Wang, Y. Zhang, W. Hou, K. Xu, W. Liu, W. Li, J. Jiao, Q. Chen, P. CHENG, and W. Xiong. Integrative decoding: Improving factuality via implicit self- consistency. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025
[27]

Y. Cui, P. He, X. Tang, Q. He, C. Luo, J. Tang, and Y. Xing. A theoretical understanding of chain- of-thought: Coherent reasoning and error-aware demonstration. In Y. Li, S. Mandt, S. Agrawal, and E. Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Resear...

work page 2025
[28]

S. B. David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. In Y. W. Teh and M. Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 129–136, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR

work page 2010
[29]

Z. Dong, Z. Liu, and Y. Mao. On the hardness of unsupervised domain adaptation: Optimal learners and information-theoretic perspective. In S. Chandar, R. Pascanu, E. Eaton, B. Liu, R. Mahmood, and A. Rannen-Triki, editors,Proceedings of The 4th Conference on Lifelong Learning Agents, volume 330 ofProceedings of Machine Learning Research, pages 89–111. PML...

work page 2025
[30]

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70757–70798. Curran Associates, Inc., 2023

work page 2023
[31]

Gambardella, Y

A. Gambardella, Y. Iwasawa, and Y. Matsuo. Language models do hard arithmetic tasks easily and hardly do easy arithmetic tasks. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 85–91, Bangkok, Thailand, Aug. 2024. Association for Comput...

work page 2024
[32]

Z. Gan, Y. Liao, and Y. Liu. Rethinking external slow-thinking: From snowball errors to probability of correct reasoning. InForty-second International Conference on Machine Learning, 2025

work page 2025
[33]

Z. Gan, R. Ren, W. Yao, X. Hu, G. Xu, C. Qian, H. Tang, Z. Gong, X. Yao, P. Tang, et al. Beyond the black box: Theory and mechanism of large language models.arXiv preprint arXiv:2601.02907, 2026

work page arXiv 2026
[34]

Ganin, E

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lem- pitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59): 1–35, 2016

work page 2016
[35]

Geiping, S

J. Geiping, S. M. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[36]

H. A. Gozeten, M. E. Ildiz, X. Zhang, H. Harutyunyan, A. S. Rawat, and S. Oymak. Continuous chain of thought enables parallel exploration and reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

M. Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020. doi: 10.1162/tacl a 00306

work page internal anchor Pith review doi:10.1162/tacl 2020
[39]

Hanneke and S

S. Hanneke and S. Kpotufe. On the value of target data in transfer learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors,Advances in Neu- ral Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 13

work page 2019
[40]

S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. E. Weston, and Y. Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling, 2025

work page 2025
[41]

X. Hu, F. Zhang, S. Chen, and Z. Yang. Unveiling the statistical foundations of chain-of-thought prompting methods.CoRR, abs/2408.14511, 2024

work page arXiv 2024
[42]

Huang, Z

J. Huang, Z. Wang, and J. D. Lee. Transformers learn to implement multi-step gradient descent with chain of thought. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[43]

Huang, Z

Y. Huang, Z. Wen, A. Singh, Y. Chi, and Y. Chen. Transformers provably learn chain-of-thought reasoning with length generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[44]

Jiang and C

J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In A. Zaenen and A. van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264–271, Prague, Czech Republic, June 2007. Association for Computational Linguistics

work page 2007
[45]

Joshi, G

N. Joshi, G. Vardi, A. Block, S. Goel, Z. Li, T. Misiakiewicz, and N. Srebro. A theory of learning with autoregressive chain of thought. In N. Haghtalab and A. Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3161–3212. PMLR, 30 Jun–04 Jul 2025

work page 2025
[46]

Kim and T

J. Kim and T. Suzuki. Transformers provably solve parity efficiently with chain of thought. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[47]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022

work page 2022
[48]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

T. Korbak, M. Balesni, E. Barnes, Y. Bengio, J. Benton, J. Bloom, M. Chen, A. Cooney, A. Dafoe, A. D. Dragan, S. Emmons, O. Evans, D. Farhi, R. Greenblatt, D. Hendrycks, M. Hobbhahn, E. Hub- inger, G. Irving, E. Jenner, D. Kokotajlo, V. Krakovna, S. Legg, D. Lindner, D. Luan, A. Madry, J. Michael, N. Nanda, D. Orr, J. Pachocki, E. Perez, M. Phuong, F. Rog...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Kruttschnitt, J

G. Kruttschnitt, J. Shim, A. Ma, D. Kim, B. Chek, A. Anand, K. Zhu, and S. O’Brien. Contrastive chain-of-thought prompting.CoRR, abs/2407.03600, 2024

work page arXiv 2024
[50]

A. Lee, E. Che, and T. Peng. How well do LLMs compress their own chain-of-thought? a token complexity approach. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

work page 2025
[51]

H. Li, S. Lu, P.-Y. Chen, X. Cui, and M. Wang. Training nonlinear transformers for chain-of-thought inference: A theoretical generalization analysis. InThe Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025
[52]

J. Li, Y. Fu, L. Fan, J. Liu, Y. Shu, C. Qin, M. Yang, I. King, and R. Ying. Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350, 2025

work page arXiv 2025
[53]

Z. Li, H. Liu, D. Zhou, and T. Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[54]

T. Liu, Q. Guo, X. Hu, C. Jiayang, Y. Zhang, X. Qiu, and Z. Zhang. Can language models learn to skip steps? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 45359–45385. Curran Associates, Inc., 2024. doi: 10.52202/079017-1441. 14

work page doi:10.52202/079017-1441 2024
[55]

T. Liu, W. Xu, W. Huang, Y. Zeng, J. Wang, X. Wang, H. Yang, and J. Li. Logic-of-thought: Injecting logic into contexts for full reasoning in large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2025.naacl-long.510 2025
[56]

M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In D. Precup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2208–2217. PMLR, 06–11 Aug 2017

work page 2017
[57]

X. Ma, G. Wan, R. Yu, G. Fang, and X. Wang. CoT-valve: Length-compressible chain-of-thought tuning. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6035, Vienna, Austria, July 2025. Association for Computational Lingui...

work page doi:10.18653/v1/2025.acl-long.300 2025
[58]

Madaan, K

A. Madaan, K. Hermann, and A. Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1448–1535, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.101

work page doi:10.18653/v1/2023.findings-emnlp.101 2023
[59]

E. Malach. Auto-regressive next-token predictors are universal learners. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 34417–34431. PMLR, 21–27 Jul 2024

work page 2024
[60]

Malon and X

C. Malon and X. Zhu. Self-consistent decoding for more factual open responses.ArXiv, abs/2403.00696, 2024

work page arXiv 2024
[61]

Mansour, M

Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. InProceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montr´ eal, Canada, 2009

work page 2009
[62]

Merrill and A

W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[63]

S. I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[64]

Mondorf and B

P. Mondorf and B. Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. InFirst Conference on Language Modeling, 2024

work page 2024
[65]

Y. Ning, W. Li, J. Fang, N. Tan, and H. Liu. Not all thoughts are generated equal: Efficient llm reasoning via multi-turn reinforcement learning.arXiv preprint arXiv:2505.11827, 2025

work page arXiv 2025
[66]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Q. Pan, W. Ji, Y. Ding, J. Li, S. Chen, J. Wang, J. Zhou, Q. Chen, M. Zhang, Y. Wu, et al. A survey of slow thinking-based reasoning llms using reinforced learning and inference-time scaling law.arXiv preprint arXiv:2505.02665, 2025. 15

work page arXiv 2025
[68]

B. Peng, S. Narayanan, and C. Papadimitriou. On limitations of the transformer architecture. InFirst Conference on Language Modeling, 2024

work page 2024
[69]

P´ erez, P

J. P´ erez, P. Barcel´ o, and J. Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021

work page 2021
[70]

Prystawski, M

B. Prystawski, M. Li, and N. Goodman. Why think step by step? reasoning emerges from the locality of experience. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70926–70947. Curran Associates, Inc., 2023

work page 2023
[71]

C. Qian, D. Liu, H. Wen, Z. Bai, Y. Liu, and J. Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[72]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving Language Understanding by Generative Pre-Training. Technical report, OpenAI, 2018

work page 2018
[73]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019

work page 2019
[74]

Redko, A

I. Redko, A. Habrard, and M. Sebban. On the analysis of adaptability in multi-source domain adap- tation.Machine Learning, 108(8):1635–1652, 2019

work page 2019
[75]

Roark and M

B. Roark and M. Bacchiani. Supervised and unsupervised PCFG adaptation to novel domains. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 205–212, 2003

work page 2003
[76]

Saito, K

K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018. doi: 10.1109/CVPR.2018.00392

work page doi:10.1109/cvpr.2018.00392 2018
[77]

Sanford, D

C. Sanford, D. Hsu, and M. Telgarsky. Representational strengths and limitations of transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[78]

Saunshi, N

N. Saunshi, N. Dikkala, Z. Li, S. Kumar, and S. J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[79]

J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for do- main adaptation. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAA...

work page 2018
[80]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

A. A. Abdullah, A. Zubiaga, S. Mirjalili, A. H. Gandomi, F. Daneshfar, M. Amini, A. S. Mohammed, and H. Veisi. Evolution of meta’s llama models and parameter-efficient fine-tuning of large language models: a survey.arXiv preprint arXiv:2510.12178, 2025

work page arXiv 2025

[2] [2]

Acuna, G

D. Acuna, G. Zhang, M. T. Law, and S. Fidler. f-domain adversarial learning: Theory and algorithms. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 66–75. PMLR, 18–24 Jul 2021

work page 2021

[3] [3]

Aghajohari, K

M. Aghajohari, K. Chitsaz, A. Kazemnejad, S. Chandar, A. Sordoni, A. Courville, and S. Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[4] [4]

Altabaa, O

A. Altabaa, O. Montasser, and J. Lafferty. Cot information: Improved sample complexity under chain- of-thought supervision. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 24822–24862. Curran Associates, Inc., 2025

work page 2025

[5] [5]

Amiri, X

A. Amiri, X. Huang, M. Rofin, and M. Hahn. Lower bounds for chain-of-thought reasoning in hard- attention transformers. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learn- ing, volume 267 ofProceedings of Machine Learning Research, ...

work page 2025

[6] [6]

Claude Opus 4.6 System Card

Anthropic. Claude Opus 4.6 System Card. System card, Anthropic, Feb. 2026. URLhttps://www-cdn. anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. Accessed: 2026-05-06

work page 2026

[7] [7]

Bachmann and V

G. Bachmann and V. Nagarajan. The pitfalls of next-token prediction. InForty-first International Conference on Machine Learning, 2024

work page 2024

[8] [8]

X. Bai, I. Pres, Y. Deng, C. Tan, S. Shieber, F. Vi´ egas, M. Wattenberg, and A. Lee. Why can’t transformers learn multiplication? reverse-engineering reveals long-range dependency pitfalls.arXiv preprint arXiv:2510.00184, 2025

work page arXiv 2025

[9] [9]

G. Bao, H. Zhang, C. Wang, L. Yang, and Y. Zhang. How likely do LLMs with CoT mimic human reasoning? In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 7831– 7850, Abu Dhabi, UAE, Jan. 2025. Association for Computational Linguistics

work page 2025

[10] [10]

Barcelo, A

P. Barcelo, A. Kozachinskiy, and T. Steifer. Ehrenfeucht-haussler rank and chain of thought. In Forty-second International Conference on Machine Learning, 2025

work page 2025

[11] [11]

Ben-David and R

S. Ben-David and R. Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. InInternational Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012

work page 2012

[12] [12]

Ben-David and R

S. Ben-David and R. Urner. Domain adaptation–can quantity compensate for quality?Annals of Mathematics and Artificial Intelligence, 70(3):185–202, 2014

work page 2014

[13] [13]

Ben-David, J

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adapta- tion. In B. Sch¨ olkopf, J. Platt, and T. Hoffman, editors,Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006

work page 2006

[14] [14]

Ben-David, J

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains.Machine learning, 79(1):151–175, 2010. 11

work page 2010

[15] [15]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[16] [16]

Besta, F

M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, G. Piao, N. Blach, P. Nyczyk, M. Copik, G. Kwa´ sniewski, J. M¨ uller, et al. Demystifying chains, trees, and graphs of thoughts.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[17] [17]

Blitzer, K

J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adap- tation. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007

work page 2007

[18] [18]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei....

work page 1901

[19] [19]

L. Chen, B. Peng, and H. Wu. Theoretical limitations of multi-layer transformer. In2025 IEEE 66th Annual Symposium on Foundations of Computer Science (FOCS), pages 2631–2653, 2025. doi: 10.1109/FOCS63196.2025.00136

work page doi:10.1109/focs63196.2025.00136 2025

[20] [20]

Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Process- ing Systems, volume 37, pages 54872–54904. Curran Associates,...

work page doi:10.52202/079017-1740 2024

[21] [21]

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y. Zhou, T. Gao, and W. Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567v5, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou. Universal self-consistency for large language models. InICML 2024 Workshop on In-Context Learning, 2024

work page 2024

[23] [23]

X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu, M. Zhou, Z. Zhang, R. Wang, Z. Tu, H. Mi, and D. Yu. Do NOT think that much for 2+3=? On the overthinking of long reasoning models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference ...

work page 2025

[24] [24]

Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, C. Denison, J. Schulman, A. Somani, P. Hase, M. Wagner, F. Roger, et al. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

J. Cheng and B. V. Durme. Compressed chain of thought: Efficient reasoning through dense represen- tations.CoRR, abs/2412.13171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Cheng, X

Y. Cheng, X. Liang, Y. Gong, W. Xiao, S. Wang, Y. Zhang, W. Hou, K. Xu, W. Liu, W. Li, J. Jiao, Q. Chen, P. CHENG, and W. Xiong. Integrative decoding: Improving factuality via implicit self- consistency. InThe Thirteenth International Conference on Learning Representations, 2025. 12

work page 2025

[27] [27]

Y. Cui, P. He, X. Tang, Q. He, C. Luo, J. Tang, and Y. Xing. A theoretical understanding of chain- of-thought: Coherent reasoning and error-aware demonstration. In Y. Li, S. Mandt, S. Agrawal, and E. Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Resear...

work page 2025

[28] [28]

S. B. David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. In Y. W. Teh and M. Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 129–136, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR

work page 2010

[29] [29]

Z. Dong, Z. Liu, and Y. Mao. On the hardness of unsupervised domain adaptation: Optimal learners and information-theoretic perspective. In S. Chandar, R. Pascanu, E. Eaton, B. Liu, R. Mahmood, and A. Rannen-Triki, editors,Proceedings of The 4th Conference on Lifelong Learning Agents, volume 330 ofProceedings of Machine Learning Research, pages 89–111. PML...

work page 2025

[30] [30]

G. Feng, B. Zhang, Y. Gu, H. Ye, D. He, and L. Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70757–70798. Curran Associates, Inc., 2023

work page 2023

[31] [31]

Gambardella, Y

A. Gambardella, Y. Iwasawa, and Y. Matsuo. Language models do hard arithmetic tasks easily and hardly do easy arithmetic tasks. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 85–91, Bangkok, Thailand, Aug. 2024. Association for Comput...

work page 2024

[32] [32]

Z. Gan, Y. Liao, and Y. Liu. Rethinking external slow-thinking: From snowball errors to probability of correct reasoning. InForty-second International Conference on Machine Learning, 2025

work page 2025

[33] [33]

Z. Gan, R. Ren, W. Yao, X. Hu, G. Xu, C. Qian, H. Tang, Z. Gong, X. Yao, P. Tang, et al. Beyond the black box: Theory and mechanism of large language models.arXiv preprint arXiv:2601.02907, 2026

work page arXiv 2026

[34] [34]

Ganin, E

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lem- pitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59): 1–35, 2016

work page 2016

[35] [35]

Geiping, S

J. Geiping, S. M. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[36] [36]

H. A. Gozeten, M. E. Ildiz, X. Zhang, H. Harutyunyan, A. S. Rawat, and S. Oymak. Continuous chain of thought enables parallel exploration and reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[37] [37]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

M. Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020. doi: 10.1162/tacl a 00306

work page internal anchor Pith review doi:10.1162/tacl 2020

[39] [39]

Hanneke and S

S. Hanneke and S. Kpotufe. On the value of target data in transfer learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors,Advances in Neu- ral Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 13

work page 2019

[40] [40]

S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. E. Weston, and Y. Tian. Training large language models to reason in a continuous latent space. InSecond Conference on Language Modeling, 2025

work page 2025

[41] [41]

X. Hu, F. Zhang, S. Chen, and Z. Yang. Unveiling the statistical foundations of chain-of-thought prompting methods.CoRR, abs/2408.14511, 2024

work page arXiv 2024

[42] [42]

Huang, Z

J. Huang, Z. Wang, and J. D. Lee. Transformers learn to implement multi-step gradient descent with chain of thought. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[43] [43]

Huang, Z

Y. Huang, Z. Wen, A. Singh, Y. Chi, and Y. Chen. Transformers provably learn chain-of-thought reasoning with length generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[44] [44]

Jiang and C

J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In A. Zaenen and A. van den Bosch, editors,Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 264–271, Prague, Czech Republic, June 2007. Association for Computational Linguistics

work page 2007

[45] [45]

Joshi, G

N. Joshi, G. Vardi, A. Block, S. Goel, Z. Li, T. Misiakiewicz, and N. Srebro. A theory of learning with autoregressive chain of thought. In N. Haghtalab and A. Moitra, editors,Proceedings of Thirty Eighth Conference on Learning Theory, volume 291 ofProceedings of Machine Learning Research, pages 3161–3212. PMLR, 30 Jun–04 Jul 2025

work page 2025

[46] [46]

Kim and T

J. Kim and T. Suzuki. Transformers provably solve parity efficiently with chain of thought. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[47] [47]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022

work page 2022

[48] [48]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

T. Korbak, M. Balesni, E. Barnes, Y. Bengio, J. Benton, J. Bloom, M. Chen, A. Cooney, A. Dafoe, A. D. Dragan, S. Emmons, O. Evans, D. Farhi, R. Greenblatt, D. Hendrycks, M. Hobbhahn, E. Hub- inger, G. Irving, E. Jenner, D. Kokotajlo, V. Krakovna, S. Legg, D. Lindner, D. Luan, A. Madry, J. Michael, N. Nanda, D. Orr, J. Pachocki, E. Perez, M. Phuong, F. Rog...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Kruttschnitt, J

G. Kruttschnitt, J. Shim, A. Ma, D. Kim, B. Chek, A. Anand, K. Zhu, and S. O’Brien. Contrastive chain-of-thought prompting.CoRR, abs/2407.03600, 2024

work page arXiv 2024

[50] [50]

A. Lee, E. Che, and T. Peng. How well do LLMs compress their own chain-of-thought? a token complexity approach. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

work page 2025

[51] [51]

H. Li, S. Lu, P.-Y. Chen, X. Cui, and M. Wang. Training nonlinear transformers for chain-of-thought inference: A theoretical generalization analysis. InThe Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025

[52] [52]

J. Li, Y. Fu, L. Fan, J. Liu, Y. Shu, C. Qin, M. Yang, I. King, and R. Ying. Implicit reasoning in large language models: A comprehensive survey.arXiv preprint arXiv:2509.02350, 2025

work page arXiv 2025

[53] [53]

Z. Li, H. Liu, D. Zhou, and T. Ma. Chain of thought empowers transformers to solve inherently serial problems. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[54] [54]

T. Liu, Q. Guo, X. Hu, C. Jiayang, Y. Zhang, X. Qiu, and Z. Zhang. Can language models learn to skip steps? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 45359–45385. Curran Associates, Inc., 2024. doi: 10.52202/079017-1441. 14

work page doi:10.52202/079017-1441 2024

[55] [55]

T. Liu, W. Xu, W. Huang, Y. Zeng, J. Wang, X. Wang, H. Yang, and J. Li. Logic-of-thought: Injecting logic into contexts for full reasoning in large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2025.naacl-long.510 2025

[56] [56]

M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In D. Precup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 2208–2217. PMLR, 06–11 Aug 2017

work page 2017

[57] [57]

X. Ma, G. Wan, R. Yu, G. Fang, and X. Wang. CoT-valve: Length-compressible chain-of-thought tuning. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6035, Vienna, Austria, July 2025. Association for Computational Lingui...

work page doi:10.18653/v1/2025.acl-long.300 2025

[58] [58]

Madaan, K

A. Madaan, K. Hermann, and A. Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In H. Bouamor, J. Pino, and K. Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1448–1535, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.101

work page doi:10.18653/v1/2023.findings-emnlp.101 2023

[59] [59]

E. Malach. Auto-regressive next-token predictors are universal learners. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 34417–34431. PMLR, 21–27 Jul 2024

work page 2024

[60] [60]

Malon and X

C. Malon and X. Zhu. Self-consistent decoding for more factual open responses.ArXiv, abs/2403.00696, 2024

work page arXiv 2024

[61] [61]

Mansour, M

Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. InProceedings of The 22nd Annual Conference on Learning Theory (COLT 2009), Montr´ eal, Canada, 2009

work page 2009

[62] [62]

Merrill and A

W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[63] [63]

S. I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[64] [64]

Mondorf and B

P. Mondorf and B. Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey. InFirst Conference on Language Modeling, 2024

work page 2024

[65] [65]

Y. Ning, W. Li, J. Fang, N. Tan, and H. Liu. Not all thoughts are generated equal: Efficient llm reasoning via multi-turn reinforcement learning.arXiv preprint arXiv:2505.11827, 2025

work page arXiv 2025

[66] [66]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[67] [67]

Q. Pan, W. Ji, Y. Ding, J. Li, S. Chen, J. Wang, J. Zhou, Q. Chen, M. Zhang, Y. Wu, et al. A survey of slow thinking-based reasoning llms using reinforced learning and inference-time scaling law.arXiv preprint arXiv:2505.02665, 2025. 15

work page arXiv 2025

[68] [68]

B. Peng, S. Narayanan, and C. Papadimitriou. On limitations of the transformer architecture. InFirst Conference on Language Modeling, 2024

work page 2024

[69] [69]

P´ erez, P

J. P´ erez, P. Barcel´ o, and J. Marinkovic. Attention is turing-complete.Journal of Machine Learning Research, 22(75):1–35, 2021

work page 2021

[70] [70]

Prystawski, M

B. Prystawski, M. Li, and N. Goodman. Why think step by step? reasoning emerges from the locality of experience. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 70926–70947. Curran Associates, Inc., 2023

work page 2023

[71] [71]

C. Qian, D. Liu, H. Wen, Z. Bai, Y. Liu, and J. Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[72] [72]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving Language Understanding by Generative Pre-Training. Technical report, OpenAI, 2018

work page 2018

[73] [73]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019

work page 2019

[74] [74]

Redko, A

I. Redko, A. Habrard, and M. Sebban. On the analysis of adaptability in multi-source domain adap- tation.Machine Learning, 108(8):1635–1652, 2019

work page 2019

[75] [75]

Roark and M

B. Roark and M. Bacchiani. Supervised and unsupervised PCFG adaptation to novel domains. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 205–212, 2003

work page 2003

[76] [76]

Saito, K

K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018. doi: 10.1109/CVPR.2018.00392

work page doi:10.1109/cvpr.2018.00392 2018

[77] [77]

Sanford, D

C. Sanford, D. Hsu, and M. Telgarsky. Representational strengths and limitations of transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[78] [78]

Saunshi, N

N. Saunshi, N. Dikkala, Z. Li, S. Kumar, and S. J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[79] [79]

J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein distance guided representation learning for do- main adaptation. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAA...

work page 2018

[80] [80]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025