pith. sign in

arxiv: 2606.21884 · v1 · pith:AVQAUMOTnew · submitted 2026-06-20 · 💻 cs.LG · cs.AI· cs.CL

A Verifiable Search Is Not a Learnable Chain-of-Thought

Pith reviewed 2026-06-26 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords chain-of-thoughtdistillationcryptarithmverifiable searchbacktrackingLoRAmemorizationverification
0
0 comments X

The pith

Search over information-free structure cannot be imitated as forward chain-of-thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether models can learn chain-of-thought for tasks that require search. On forward-computable tasks like arithmetic, distillation works well. On cryptarithm, which needs backtracking search, even extensive training fails to produce faithful derivations, though the model handles individual steps. Revealing the solution key turns the task forward and improves performance. This shows that certain search procedures resist learning as left-to-right reasoning and instead require precomputing the search results into a lookup.

Core claim

When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

What carries the argument

The controlled key-revealing intervention that converts backtracking search into forward derivation by supplying the cipher solution upfront.

If this is right

  • Forward-computable tasks such as lookup, arithmetic, and 8-bit boolean install with accuracies of 0.99 and 0.68.
  • Cryptarithm distillation holds at 0.01-0.07 across eleven designs, RL, and self-training despite a search solver reaching 71%.
  • Models perform arithmetic correctly on 97-100% of lines and rank the correct cipher in the top eight 71% of the time but cannot carry the search forward.
  • Fine-tuning learns the shape of a verifiable elimination step while verdicts become unconditional templates correct only 16-57% of the time.
  • Precomputing the combinatorial core into a catalog reduces the trace to recall plus verification and reaches 0.92 on the private leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Chain-of-thought distillation may be limited to tasks whose solutions have inherent left-to-right structure without hidden combinatorial search.
  • The result suggests that hybrid systems combining model recall with external search modules will be required for this class of problems.
  • The same separation between search and forward derivation could appear in other combinatorial reasoning tasks whose generators are deterministic but whose solutions depend on exhaustive elimination.

Load-bearing premise

The eleven chain-of-thought designs, RL from verifiable rewards, self-training, and the key-revealing intervention together test whether search can be learned as forward derivation rather than some other untested regime succeeding.

What would settle it

A model achieving high accuracy on held-out cryptarithm instances via forward chain-of-thought distillation alone, without key revelation or precomputed catalog, would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.21884 by Harsh Patel.

Figure 1
Figure 1. Figure 1: The learnability frontier. Solver (backtracking search) vs. the forward-derivable ceiling (what a left-to-right CoT could honestly imitate) vs. the fine-tuned model. For lookup/fit tasks all three coincide. For cryptarithm the solver reaches 0.71, but the forward-derivable ceiling collapses to ≈ 0.10 and the model tracks that, not the solver (0.05). bit_manipulation is the exception where forward derivatio… view at source ↗
Figure 2
Figure 2. Figure 2: Witnessed vs. teleported verdict. The teleport line is verbatim from a trained model’s transcript (10b71e8a): it computes “6*4 ends 4”, which matches the target ones digit 4, on the very same line, then concludes “none matches → drop,” wrongly eliminating the correct operation. The arithmetic is correct; the verdict does not follow from it. SFT on traces whose verdicts are not witnessed installs exactly th… view at source ↗
Figure 3
Figure 3. Figure 3: Bit-manipulation: only the model’s own search transfers. Hand-written CoT that teleports the rule search scores like the base model (0.053). STaR on verifier-passed self-traces lifts accuracy and collapses budget truncation, because the imitated traces are searches the model can actually run within budget. Baseline→STaR is significant (disjoint 95% CIs, n=500). The r1→r2 step is within noise. grammar that … view at source ↗
Figure 4
Figure 4. Figure 4: One floor, many rounds. Cryptarithm-deduce accuracy under successive CoT designs (r1–r11. The nine reaching a held-out greedy eval are shown; RLVR/GRPO and the generate-and￾verify reframe ran between rounds; [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Verdict-as-token. Across the elimination-line types that drive cryptarithm, the arithmetic on each line is essentially perfect while the line’s verdict, “therefore drop this candidate”, follows from its own numbers only 16–51% of the time. The model imitates the shape of a verifiable step without its content. Line counts n = 390/102/79/35. The arithmetic-vs-verdict gap is significant for every type (95% CI… view at source ↗
Figure 6
Figure 6. Figure 6: Leaderboard trajectory. v15→v16 is a regression from a warm-start that failed to enter the new grammars at greedy decoding; v17 (0.85) is the first reproducible from-base bank. V18 adds bit-manipulation STaR but does not move the total, the headroom that remains is cryptarithm. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
read the original abstract

It is tempting to assume any task solvable by a short program can be taught to a model as its chain-of-thought: write the steps out, fine-tune, and the model follows. This paper shows the assumption fails for an identifiable class of procedures. The testbed is nine reasoning tasks, each from a deterministic generator; public and hidden splits share generators, so held-out data proxies test accuracy. I reverse-engineer the generators into Python solvers, render them as chain-of-thought, and distill into a rank-<= 32 LoRA over a 30B (3.5B-active) Nemotron model. Forward-computable tasks install readily: lookup/arithmetic and an 8-bit boolean task transfer (>= 0.99 and 0.68). Cryptarithm does not: distilling its backtracking search holds at 0.01-0.07 across eleven chain-of-thought designs, RL from verifiable rewards, and self-training, even though a search solver answers 71% of instances. This is not a capability gap. The model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%; it cannot carry the search forward as a left-to-right derivation. Fine-tuning learns the shape of a verifiable elimination step while its verdicts become unconditional templates, correct only 16-57% of the time ("verdict-as-token"). The ceiling holds across backbones from 3B to 671B and across fine-tuning and prompting; a controlled intervention isolates the cause: revealing the cipher key, which turns the derivation forward, lifts the same instances from 0.03 to 0.57. When a procedure's only solution is search over information-free structure, no faithful forward chain-of-thought exists to imitate. The task becomes learnable only by removing the search, precomputing its combinatorial core into a catalog and reducing the trace to recall plus verification; the 1st-place solution reaches Private LB 0.92 this way. What distills is memorization and verification, not search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that tasks whose only solution is search over information-free structure (e.g., cryptarithm) admit no faithful forward chain-of-thought that can be imitated via distillation, RL from verifiable rewards, or self-training. This is shown by failure (0.01-0.07 accuracy) across eleven CoT designs on a 30B Nemotron model despite a solver reaching 71%, contrasted with high success on forward-computable tasks; the model performs local arithmetic (97-100%) and ranking but cannot carry search forward. A key-revealing intervention lifts performance from 0.03 to 0.57 on the same instances, and success is achieved only by precomputing the combinatorial core into a catalog (reducing to recall+verification, reaching 0.92 LB). The result holds across backbones 3B-671B.

Significance. If the central empirical distinction holds, the work provides concrete evidence that not every short-program-solvable task can be taught as left-to-right CoT imitation, separating search from forward derivation. Strengths include the controlled intervention isolating the forward vs. search distinction, consistent failure across multiple training regimes and model scales, and the explicit contrast with the catalog-based solution that succeeds.

major comments (2)
  1. [Abstract] Abstract: the claim that 'no faithful forward chain-of-thought exists to imitate' is stronger than the reported evidence, which demonstrates failure only for the eleven tested CoT designs, RL from verifiable rewards, and self-training; the manuscript does not test or rule out other regimes (e.g., dense process supervision or auxiliary planning losses) that might induce an approximate forward surrogate.
  2. [Abstract] Abstract: the statement that 'the model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%' is presented as evidence that the failure is isolated to search, but without a table or section detailing the exact measurement protocol, aggregation across the eleven designs, or per-instance breakdown, it is difficult to assess whether local steps are truly solved or merely templated.
minor comments (1)
  1. The abstract refers to 'nine reasoning tasks' and 'eleven chain-of-thought designs' without enumerating them or providing a table; adding an explicit list or appendix table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and improve the presentation of our measurements. We address each major comment below and will make corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'no faithful forward chain-of-thought exists to imitate' is stronger than the reported evidence, which demonstrates failure only for the eleven tested CoT designs, RL from verifiable rewards, and self-training; the manuscript does not test or rule out other regimes (e.g., dense process supervision or auxiliary planning losses) that might induce an approximate forward surrogate.

    Authors: We agree that the absolute phrasing exceeds the tested regimes. Our experiments cover eleven distinct CoT formats, RL from verifiable rewards, and self-training across model scales, all of which fail to induce faithful forward search. The key-revealing intervention and contrast with catalog-based solutions further isolate the forward-vs-search distinction. Nevertheless, we cannot rule out every conceivable auxiliary loss. We will revise the abstract to read 'no faithful forward chain-of-thought was found to imitate under the tested regimes' and add a limitations paragraph discussing dense process supervision and planning losses as open directions. revision: partial

  2. Referee: [Abstract] Abstract: the statement that 'the model does the arithmetic on 97-100% of lines and ranks the correct cipher in its top eight on 71%' is presented as evidence that the failure is isolated to search, but without a table or section detailing the exact measurement protocol, aggregation across the eleven designs, or per-instance breakdown, it is difficult to assess whether local steps are truly solved or merely templated.

    Authors: The referee is correct that the measurement protocol requires explicit documentation. We will insert a new subsection (Methods 3.4) that specifies: (i) line-by-line arithmetic verification via exact string matching against the solver trace, (ii) ranking measured by the position of the ground-truth next cipher in the model's top-8 logits at each elimination step, (iii) aggregation as macro-average over all generated lines across the eleven designs, and (iv) per-instance and per-design breakdowns placed in Appendix C. Manual audit of 200 random lines confirmed non-templated arithmetic (97-100% accuracy) independent of search success. revision: yes

Circularity Check

0 steps flagged

Empirical study with direct experimental comparisons; no circular derivation

full rationale

The paper reports experimental results from distilling chain-of-thought across eleven designs, RL from verifiable rewards, self-training, multiple model scales, and a controlled key-revealing intervention on nine tasks generated from deterministic solvers. Central claims rest on observed accuracy gaps (e.g., 0.01-0.07 vs. 71% solver) and the intervention lift (0.03 to 0.57), which are measured against external solvers and held-out splits rather than derived from fitted parameters or self-referential equations. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing premises; the work is self-contained against the reported benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim relies on the empirical outcomes of the distillation experiments and the key-revealing intervention, with the main assumption being the validity of the task generators as proxies for search procedures.

axioms (1)
  • domain assumption Public and hidden splits from the same deterministic generators can proxy for held-out test accuracy on the procedure.
    This allows testing if the model learns the general procedure rather than specific instances.

pith-pipeline@v0.9.1-grok · 5917 in / 1341 out tokens · 47455 ms · 2026-06-26T12:33:28.331775+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Wrong-finished

    Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length gener- alization in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/p 26 Table 17: Length...

  2. [2]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28, 2015. URL https://papers.nips.cc/pap er_files/paper/2015/hash/e995f98d56967d946471af29d7bf99f1- Abstract.html . arXiv:1506.03099

  3. [3]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  4. [4]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofPMLR, pages 10041–10071. PMLR, 2024. URL https://proceedings.mlr.press/v235/dao24a.html. arXiv:2405.21060

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, et al. DeepSeek-R1 incentivizes reasoning in LLMs through rein- forcement learning.Nature, 645:633–638, 2025. doi: 10.1038/s41586-025-09422-z. URL https://www.nature.com/articles/s41586-025-09422-z. arXiv:2501.12948

  6. [6]

    Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A. Ortega. Neural networks and the chomsky hierarchy. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=WbxHAzkeQcn. arXiv:2207.02098

  7. [7]

    Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. InAdvances in Neural Information Processing Systems (Neu...

  8. [8]

    1st place solution: NVIDIA nemotron model reasoning challenge

    GoodMeatDay, re, and reopon. 1st place solution: NVIDIA nemotron model reasoning challenge. Kaggle competition write-up, https://www.kaggle.com/competitions/nvidia-nemotro n-model-reasoning-challenge/writeups/1st-place-solution , 2026. Team NullSira; Private LB 0.920; memorization–computation split (signature catalog + DFS verify). Accessed 2026-06-17

  9. [9]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021. URLhttps://datasets-benchmarks-proceed ings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-rou nd2.html. arXiv:2103.03874

  10. [10]

    Distilling the knowledge in a neural network,

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

  11. [11]

    NIPS 2014 Deep Learning Workshop. 28

  12. [12]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017. ACL, 2023. doi: 10....

  13. [13]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URLhttps://openrevi ew.net/forum?id=nZeVKeeFYf9. arXiv:2106.09685

  14. [14]

    Kakade, and Eran Malach

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/jelassi24a.html. arXiv:2402.01032

  15. [15]

    NVIDIA nemotron model reasoning challenge

    Kaggle and NVIDIA. NVIDIA nemotron model reasoning challenge. https://www.kagg le.com/competitions/nvidia-nemotron-model-reasoning-challenge , 2026. Accessed 2026-06-16

  16. [16]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626. ACM, 2023. doi: 10.1145/3600006.3613

  17. [17]

    Tülu 3: Pushing frontiers in open language model post-training, 2024

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, et al. Tülu 3: Pushing frontiers in open language model post-training, 2024. arXiv:2411.15124

  18. [18]

    Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023. Anthropic technical report

  19. [19]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International Conference on Learning Representations (ICLR), 2024. URLhttps://openrevi ew.net/forum?id=v8L0pN6EOi. arXiv:2305.20050

  20. [20]

    Faithful chain-of-thought reasoning

    Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing (IJCNLP-AACL), pages 305–

  21. [21]

    doi: 10.18653/v1/2023.ijcnlp-main.20

    ACL, 2023. doi: 10.18653/v1/2023.ijcnlp-main.20. URLhttps://aclanthology.org/2 023.ijcnlp-main.20/

  22. [22]

    The illusion of state in state-space models

    William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. InProceedings of the 41st International Conference on Machine Learning (ICML), PMLR. PMLR, 2024. URL https://proceedings.mlr.press/v235/merrill24a.html . arXiv:2404.08819. 29

  23. [23]

    s1: Simple test-time scaling, 2025

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. arXiv:2501.19393

  24. [24]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025

    NVIDIA. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning, 2025. arXiv:2512.20848

  25. [25]

    Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. ICLR 2021 MATH-AI Workshop

  26. [26]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2016. arXiv:1511.06732

  27. [27]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. arXiv:2402.03300

  28. [28]

    Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026

    Hui Kang Tong. Nemotron model reasoning challenge: Open progress prize solution.https: //github.com/tonghuikang/nemotron, 2026. Open Progress Prize; public leaderboard 0.85

  29. [29]

    End-to-end fine-tuning for LB 0.85

    Hui Kang Tong. End-to-end fine-tuning for LB 0.85. Kaggle notebook,https://www.kaggle.c om/code/huikang/end-to-end-finetuning-for-lb-0-85 , 2026. Published Open Progress Prize recipe; our native-HF pipeline forks this notebook

  30. [30]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ed3fea9033a80fea1 376299fa7863f4a-Abstract-C...

  31. [31]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=1PL1NIMMrw. arXiv:2203.11171

  32. [32]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524e cf4f15af0f7b31...

  33. [33]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35,

  34. [34]

    arXiv:2203.14465

    URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/639a9a172 c044fbb64175b5fad42e9a5-Abstract-Conference.html. arXiv:2203.14465

  35. [35]

    What algorithms can transformers learn? a study in length generalization

    Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. InInternational Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=AssIuHnmHX. arXiv:2310.16028. 30