Can Transformers Learn to Verify During Backtracking Search?

Katsumi Inoue; Tony Ribeiro; Tuan Nguyen; Yin Jun Phua

arxiv: 2605.22221 · v1 · pith:GJOPH4CZnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.LO

Can Transformers Learn to Verify During Backtracking Search?

Yin Jun Phua , Tony Ribeiro , Tuan Nguyen , Katsumi Inoue This is my paper

Pith reviewed 2026-05-22 07:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO

keywords backtracking searchtransformersselective state attentionconstraint satisfactionhistory dependencereasoningsearch verification

0 comments

The pith

A fixed attention mask lets transformers base backtracking decisions only on the current state rather than on search history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when transformers are trained to predict continue-or-backtrack on full cumulative traces of solver steps, they do not learn to ignore irrelevant prior history. This happens because state features get scattered across the sequence and because the model can latch onto trajectory-specific patterns. The authors introduce localization to keep state features local and Selective State Attention, which is a fixed mask that blocks attention to non-state tokens. With this, the model produces the same decision for two different paths that arrive at identical states. This matters for any system where a language model generates its own search tree, since the right next step should not depend on how it got there.

Core claim

Decoder-only transformers trained autoregressively on cumulative solver traces fail to make decisions that depend only on the current search state, due to scattered state features and history entanglement. Selective State Attention addresses this by applying a fixed attention mask that enforces attention only within the current state block, allowing the model to emit identical continue-or-backtrack predictions for same-state pairs reached via different histories.

What carries the argument

Selective State Attention (SSA), a fixed attention mask over the input trace that restricts the model's attention to tokens representing the current search state.

If this is right

The model will give consistent verification answers after propagation detects a contradiction, independent of path.
Training requires no changes to data, loss, or model parameters.
The approach applies across constraint satisfaction problems like 3-SAT and planning domains like Blocks World.
Similar issues may arise in any autoregressive model that searches by generating its own intermediate steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

At inference time, clearing the context to only the current state could achieve similar isolation without needing the mask during training.
This structural fix might help in theorem provers or planners that use transformers to explore proof or plan trees.
The diagnostic of history entanglement could be applied to other sequential tasks where state determinism is required.

Load-bearing premise

The test domains and the way same-state pairs are constructed are representative enough that matching decisions on those pairs implies the model has learned state-based verification in general.

What would settle it

Run SSA and the baseline on a new backtracking problem, construct pairs of trajectories that reach the identical partial solution state through different sequences, and check if their continue-or-backtrack outputs match exactly.

Figures

Figures reproduced from arXiv: 2605.22221 by Katsumi Inoue, Tony Ribeiro, Tuan Nguyen, Yin Jun Phua.

**Figure 2.** Figure 2: Three trace encodings for tree traversal: [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: All-correct verification accuracy on star-tree traversal as a function [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: SSA attention mask. Blue: full access; green: causal within the prefix; [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Schematic of two constructs introduced in this section. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Solve rates across three domains (mean ± std). Left: primary 3-SAT under three inference protocols. Center: 3-SAT n=75 phase-transition (8 seeds, cumulative inference only) with per-seed points. Right: Backtracking parsing under two protocols. reliably as trajectories grow. On backtracking parsing of an ambiguous expression grammar (state block contains only the input tokens, cursor, and parser stack), S… view at source ↗

**Figure 7.** Figure 7: BFS and DFS training curves on tree traversal under the natural en [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Solve rate under synthetic corruption of oracle dead-end decisions. False [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Solve rate versus conditional false-prune rate [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: Length-OOD comparison (3 seeds, held-out validation, shaded regions [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗

**Figure 11.** Figure 11: Padding-control counterfactual (3 seeds, 600 trials per model and [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗

**Figure 12.** Figure 12: Per-seed solve rates for the train-from-scratch ablation. Error bars: [PITH_FULL_IMAGE:figures/full_fig_p048_12.png] view at source ↗

read the original abstract

Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model's input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformers on cumulative search traces mix history into decisions, and a fixed attention mask can enforce state-only behavior, though the tests are still narrow.

read the letter

The paper shows that decoder-only transformers trained on full solver traces fail to base continue-or-backtrack choices purely on the current state. Two issues appear: state features get scattered across the sequence, and the model can latch onto trajectory history instead of the state itself. They fix the first with a localization rewrite that keeps state information local in each decision block. They fix the second with Selective State Attention, a static mask that restricts attention to the current block without touching the loss or data.

Referee Report

2 major / 2 minor

Summary. The paper claims that decoder-only transformers trained autoregressively on cumulative solver traces for backtracking search fail to produce state-dependent decisions due to scattered state features across the trace and history entanglement. It introduces localization (a trace rewrite exposing state features locally) and Selective State Attention (SSA), a fixed attention mask enforcing attention only to the current state block. Focused on reactive verification after propagation, experiments on 3-SAT, graph coloring, Blocks World, and backtracking parsing show SSA yields identical decisions on same-state pairs differing only in prior history, unlike a cumulative causal baseline. The contribution is positioned as a diagnostic plus structural fix, with inference-time context clearing suggested as an alternative.

Significance. If the empirical result holds, the work is significant for identifying a concrete failure mode in transformer-based search over serialized trajectories and offering a parameter-free structural intervention (SSA) that does not alter data, loss, or model size. The same-state pair diagnostic is a clear, falsifiable test of state-based verification. Credit is due for the reproducible experimental setup across four domains and for highlighting relevance to pretrained LLMs performing internal search.

major comments (2)

[§4] §4 (Experiments, same-state pair results): The central claim that SSA produces identical decisions while the baseline does not rests on the construction and evaluation of same-state pairs. The manuscript provides no equations, pseudocode, or precise definition of the identity metric (exact token match vs. distribution distance) nor reports error bars or run-to-run variance. This detail is load-bearing because, without it, the observed identity could arise from pair-generation artifacts rather than the mask, as the stress-test concern highlights.
[§3.2] §3.2 (Selective State Attention): The claim that the fixed mask enforces state-based decisions 'structurally without side effects on learning dynamics' is not supported by any ablation on gradient flow, expressivity for non-contradiction steps, or comparison to learned masks. Because the mask is presented as the key load-bearing fix for history entanglement, absence of this analysis leaves the generality of the result open.

minor comments (2)

[§1] The abstract and §1 mention 'inference-time context clearing' as a candidate application but provide no concrete implementation sketch or preliminary result.
[§3.2] Notation for the attention mask in SSA would benefit from an explicit diagram or matrix example showing which positions remain visible after localization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§4] §4 (Experiments, same-state pair results): The central claim that SSA produces identical decisions while the baseline does not rests on the construction and evaluation of same-state pairs. The manuscript provides no equations, pseudocode, or precise definition of the identity metric (exact token match vs. distribution distance) nor reports error bars or run-to-run variance. This detail is load-bearing because, without it, the observed identity could arise from pair-generation artifacts rather than the mask, as the stress-test concern highlights.

Authors: We agree that the precise definition and supporting statistics are necessary to substantiate the central claim and to address potential artifacts. In the revised manuscript we will add equations and pseudocode that formally define same-state pair construction (identical current search state reached via distinct histories) and the identity metric (exact token-level match on the verification decision). We will also report mean and standard deviation of the identity rate across five independent training runs with different random seeds, confirming that the effect is robust rather than an artifact of any single pair-generation procedure. revision: yes
Referee: [§3.2] §3.2 (Selective State Attention): The claim that the fixed mask enforces state-based decisions 'structurally without side effects on learning dynamics' is not supported by any ablation on gradient flow, expressivity for non-contradiction steps, or comparison to learned masks. Because the mask is presented as the key load-bearing fix for history entanglement, absence of this analysis leaves the generality of the result open.

Authors: The fixed mask in SSA is applied identically during training and inference and simply zeros attention logits outside the current state block; the loss, optimizer, and parameter updates therefore remain unchanged for all attended positions. We will add a short clarifying paragraph in §3.2 that explains this construction and notes that expressivity for non-contradiction steps is preserved because the mask never removes state features required for the verification decision. A direct comparison to learned masks would require a different training regime and is therefore outside the scope of the present parameter-free intervention; we view such an ablation as valuable future work rather than a prerequisite for the current result. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical identity result on same-state pairs is not forced by construction

full rationale

The paper's central claim is an empirical observation that SSA produces identical decisions on constructed same-state pairs (differing only in prior history) while a cumulative baseline does not. This rests on experimental comparison across domains rather than any derivation, equation, or fitted parameter that reduces to its own inputs by definition. No self-definitional steps, predictions called from fits, load-bearing self-citations, or ansatzes smuggled via prior work appear in the described chain. The SSA mask is presented as a structural intervention whose effect is measured externally on test pairs, making the analysis self-contained against the benchmark of decision identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard transformer architecture assumptions and the empirical construction of same-state test pairs; no free parameters are introduced and the new entity is the SSA mask itself.

axioms (1)

domain assumption Decoder-only transformers trained autoregressively on cumulative traces can be evaluated on state-equivalence pairs
Invoked when constructing the diagnostic test for identical decisions on same-state pairs.

invented entities (1)

Selective State Attention (SSA) no independent evidence
purpose: Fixed attention mask that enforces state-based decisions without modifying training data or parameters
New structural component introduced to address history entanglement.

pith-pipeline@v0.9.0 · 5812 in / 1355 out tokens · 30440 ms · 2026-05-22T07:15:16.647774+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally... On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 (SSA Conditional Invariance)... the logits at positions inside B_t depend only on (P, B_t).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages

[1]

Proceedings of the 5th International Joint Conference on Learning & Reasoning (IJCLR) , year=

Transformers Can Admit Mistakes and Backtrack , author=. Proceedings of the 5th International Joint Conference on Learning & Reasoning (IJCLR) , year=

work page
[2]

Communications of the ACM , volume =

Kowalski, Robert , title =. Communications of the ACM , volume =. 1979 , doi =

work page 1979
[3]

Veli. The. Proceedings of the 39th International Conference on Machine Learning , volume=. 2022 , organization=

work page 2022
[4]

Deep Learning for Code Workshop at ICLR 2022 , year=

Show Your Work: Scratchpads for Intermediate Computation with Language Models , author=. Deep Learning for Code Workshop at ICLR 2022 , year=

work page 2022
[5]

2021 , doi =

Training verifiers to solve math word problems , author=. 2021 , doi =

work page 2021
[6]

The Eleventh International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year =

work page
[7]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling. 2025 , url=

work page 2025
[8]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[9]

Learning a

Selsam, Daniel and Lamm, Matthew and B. Learning a. International Conference on Learning Representations (ICLR) , year =

work page
[10]

The Twelfth International Conference on Learning Representations , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year =

work page
[11]

The Twelfth International Conference on Learning Representations , year =

Vision Transformers Need Registers , author=. The Twelfth International Conference on Learning Representations , year =

work page
[12]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , volume =

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , volume =

work page
[13]

Neurocomputing , volume =

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng , title =. Neurocomputing , volume =. 2024 , doi =

work page 2024
[14]

Advances in Neural Information Processing Systems , volume=

Object-centric learning with slot attention , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Step Back to Leap Forward: Self-Backtracking for Symbolic Reasoning and Planning in Language Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2026 , doi=

work page 2026
[16]

2025 , doi =

Kim, Joongwon and Goyal, Anirudh and Tan, Liang and Hajishirzi, Hannaneh and Iyer, Srinivasan and Wang, Tianlu , note =. 2025 , doi =

work page 2025
[17]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , series=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , series=

work page
[18]

Can Transformers Reason Logically?

Pan, Leyan and Ganesh, Vijay and Abernethy, Jacob and Esposo, Chris and Lee, Wenke , booktitle=. Can Transformers Reason Logically?. 2025 , organization=

work page 2025
[19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[20]

2020 , doi =

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Cvicek, Vaclav and Fisher, Zachary and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit and Wang, Qifan and Yang, Li , booktitle=. 2020 , doi =

work page 2020
[21]

Advances in Neural Information Processing Systems , volume=

Block transformer: Global-to-local language modeling for fast inference , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi =

work page 2024
[22]

The Thirteenth International Conference on Learning Representations , year =

Selective Attention Improves Transformer , author=. The Thirteenth International Conference on Learning Representations , year =

work page
[23]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[24]

Proceedings of the 41st International Conference on Machine Learning , series=

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , series=

work page
[25]

Nature , volume=

Solving olympiad geometry without human demonstrations , author=. Nature , volume=. 2024 , publisher=

work page 2024
[26]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

work page 2024
[27]

2023 , doi =

System 2 Attention (is something you might need too) , author=. 2023 , doi =

work page 2023
[28]

Proceedings of the 40th International Conference on Machine Learning , series=

Scaling laws for reward model overoptimization , author=. Proceedings of the 40th International Conference on Machine Learning , series=

work page
[29]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[30]

The Twelfth International Conference on Learning Representations , year=

NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks , author=. The Twelfth International Conference on Learning Representations , year=

work page
[31]

Nature , volume=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , volume=. 2026 , publisher=

work page 2026
[32]

Second Conference on Language Modeling , year=

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning , author=. Second Conference on Language Modeling , year=

work page
[33]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=. 2025 , doi =

work page 2025
[34]

and Littman, Michael L

Li, Lihong and Walsh, Thomas J. and Littman, Michael L. , title=. International Symposium on Artificial Intelligence and Mathematics (

work page
[35]

and Thomas, Joy A

Cover, Thomas M. and Thomas, Joy A. , title=. 2006 , doi =

work page 2006
[36]

, title=

Tsybakov, Alexandre B. , title=. 2009 , doi =

work page 2009
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards clause-learning state space search: Learning to recognize dead-ends , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2016 , doi =

work page 2016
[38]

Advances in Neural Information Processing Systems , volume =

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , author =. Advances in Neural Information Processing Systems , volume =. 2022 , doi =

work page 2022
[39]

2022 , doi =

Teaching Algorithmic Reasoning via In-context Learning , author =. 2022 , doi =

work page 2022
[40]

Transactions on Machine Learning Research , issn =

Linear Algebra with Transformers , author =. Transactions on Machine Learning Research , issn =. 2022 , url =

work page 2022
[41]

Advances in Neural Information Processing Systems , volume =

Exploring Length Generalization in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , doi =

work page 2022
[42]

The Twelfth International Conference on Learning Representations , year =

Teaching Arithmetic to Small Transformers , author =. The Twelfth International Conference on Learning Representations , year =

work page
[43]

2023 , doi =

Length Generalization in Arithmetic Transformers , author =. 2023 , doi =

work page 2023
[44]

Proceedings of the 40th International Conference on Machine Learning , series =

Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , series =

work page
[45]

The Twelfth International Conference on Learning Representations , year =

Looped Transformers are Better at Learning Learning Algorithms , author =. The Twelfth International Conference on Learning Representations , year =

work page
[46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Predicting Propositional Satisfiability via End-to-End Learning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2020 , doi =

work page 2020
[47]

2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =

SATformer: Transformer-Based UNSAT Core Learning , author =. 2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =. 2023 , doi =

work page 2023
[48]

2021 , doi =

Transformer-based Machine Learning for Fast SAT Solvers and Logic Synthesis , author =. 2021 , doi =

work page 2021
[49]

and Lewis, Mike , title =

Press, Ofir and Smith, Noah A. and Lewis, Mike , title =. International Conference on Learning Representations (ICLR) , year =

work page
[50]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Kazemnejad, Amirhossein and Padhi, Inkit and Natesan Ramamurthy, Karthikeyan and Das, Payel and Reddy, Siva , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2023 , doi =

work page 2023
[51]

and Lynce, In\^es and Malik, Sharad , title =

Marques-Silva, Jo\ ao P. and Lynce, In\^es and Malik, Sharad , title =. Handbook of Satisfiability -- Second Edition , editor =. 2021 , doi =

work page 2021
[52]

2006 , doi =

Handbook of Constraint Programming , series =. 2006 , doi =

work page 2006
[53]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , doi =

work page 2022
[54]

Advances in Neural Information Processing Systems , volume=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , volume=. 2022 , doi =

work page 2022
[55]

Advances in Neural Information Processing Systems , volume=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[56]

2025 , publisher=

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=. 2025 , publisher=

work page 2025
[57]

Advances in Neural Information Processing Systems , volume=

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[58]

Advances in Neural Information Processing Systems , volume=

Faith and Fate: Limits of Transformers on Compositionality , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[59]

2022 , doi =

Solving Math Word Problems With Process- and Outcome-Based Feedback , author=. 2022 , doi =

work page 2022
[60]

Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William Yang , booktitle=. Logic-. 2023 , doi =

work page 2023
[61]

Advances in Neural Information Processing Systems , volume=

On the Planning Abilities of Large Language Models - A Critical Investigation , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[62]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023
[63]

Gou, Zhibin and Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Yang, Yujiu and Duan, Nan and Chen, Weizhu , booktitle=

work page
[64]

Journal of the ACM , volume =

Davis, Martin and Putnam, Hilary , title =. Journal of the ACM , volume =. 1960 , publisher =

work page 1960
[65]

, title =

Davis, Martin and Logemann, George and Loveland, Donald W. , title =. Communications of the ACM , volume =. 1962 , publisher =

work page 1962
[66]

Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =

Marques-Silva, Jo. Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =. 1996 , publisher =

work page 1996
[67]

and Madigan, Conor F

Moskewicz, Matthew W. and Madigan, Conor F. and Zhao, Ying and Zhang, Lintao and Malik, Sharad , title =. Proceedings of the 38th Design Automation Conference (DAC) , pages =. 2001 , publisher =

work page 2001
[68]

Computational Intelligence , volume =

Prosser, Patrick , title =. Computational Intelligence , volume =. 1993 , publisher =

work page 1993
[69]

An Extensible

E. An Extensible. Theory and Applications of Satisfiability Testing (SAT 2003) , series =. 2003 , publisher =

work page 2003
[70]

Gomez and

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and. Attention is All You Need , booktitle =. 2017 , publisher =

work page 2017
[71]

International Conference on Learning Representations , year =

Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , year =

work page
[72]

The Annals of Statistics , volume =

Bradley Efron , title =. The Annals of Statistics , volume =. 1979 , publisher =

work page 1979
[73]

Proceedings of the 31st

Bryan Ford , title =. Proceedings of the 31st. 2004 , publisher =

work page 2004
[74]

Mitchell and Bart Selman and Hector J

David G. Mitchell and Bart Selman and Hector J. Levesque , title =. Proceedings of the 10th National Conference on Artificial Intelligence , pages =. 1992 , publisher =

work page 1992
[75]

Crawford and Larry D

James M. Crawford and Larry D. Auton , title =. Artificial Intelligence , volume =. 1996 , publisher =

work page 1996
[76]

Neural Computation , volume=

Long Short-Term Memory , author=. Neural Computation , volume=. 1997 , publisher=

work page 1997
[77]

IEEE Transactions on Neural Networks , volume=

The Graph Neural Network Model , author=. IEEE Transactions on Neural Networks , volume=. 2009 , publisher=

work page 2009

[1] [1]

Proceedings of the 5th International Joint Conference on Learning & Reasoning (IJCLR) , year=

Transformers Can Admit Mistakes and Backtrack , author=. Proceedings of the 5th International Joint Conference on Learning & Reasoning (IJCLR) , year=

work page

[2] [2]

Communications of the ACM , volume =

Kowalski, Robert , title =. Communications of the ACM , volume =. 1979 , doi =

work page 1979

[3] [3]

Veli. The. Proceedings of the 39th International Conference on Machine Learning , volume=. 2022 , organization=

work page 2022

[4] [4]

Deep Learning for Code Workshop at ICLR 2022 , year=

Show Your Work: Scratchpads for Intermediate Computation with Language Models , author=. Deep Learning for Code Workshop at ICLR 2022 , year=

work page 2022

[5] [5]

2021 , doi =

Training verifiers to solve math word problems , author=. 2021 , doi =

work page 2021

[6] [6]

The Eleventh International Conference on Learning Representations , year =

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year =

work page

[7] [7]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling. 2025 , url=

work page 2025

[8] [8]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[9] [9]

Learning a

Selsam, Daniel and Lamm, Matthew and B. Learning a. International Conference on Learning Representations (ICLR) , year =

work page

[10] [10]

The Twelfth International Conference on Learning Representations , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author=. The Twelfth International Conference on Learning Representations , year =

work page

[11] [11]

The Twelfth International Conference on Learning Representations , year =

Vision Transformers Need Registers , author=. The Twelfth International Conference on Learning Representations , year =

work page

[12] [12]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , volume =

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , title =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , volume =

work page

[13] [13]

Neurocomputing , volume =

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng , title =. Neurocomputing , volume =. 2024 , doi =

work page 2024

[14] [14]

Advances in Neural Information Processing Systems , volume=

Object-centric learning with slot attention , author=. Advances in Neural Information Processing Systems , volume=

work page

[15] [15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Step Back to Leap Forward: Self-Backtracking for Symbolic Reasoning and Planning in Language Models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2026 , doi=

work page 2026

[16] [16]

2025 , doi =

Kim, Joongwon and Goyal, Anirudh and Tan, Liang and Hajishirzi, Hannaneh and Iyer, Srinivasan and Wang, Tianlu , note =. 2025 , doi =

work page 2025

[17] [17]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , series=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , series=

work page

[18] [18]

Can Transformers Reason Logically?

Pan, Leyan and Ganesh, Vijay and Abernethy, Jacob and Esposo, Chris and Lee, Wenke , booktitle=. Can Transformers Reason Logically?. 2025 , organization=

work page 2025

[19] [19]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[20] [20]

2020 , doi =

Ainslie, Joshua and Ontanon, Santiago and Alberti, Chris and Cvicek, Vaclav and Fisher, Zachary and Pham, Philip and Ravula, Anirudh and Sanghai, Sumit and Wang, Qifan and Yang, Li , booktitle=. 2020 , doi =

work page 2020

[21] [21]

Advances in Neural Information Processing Systems , volume=

Block transformer: Global-to-local language modeling for fast inference , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi =

work page 2024

[22] [22]

The Thirteenth International Conference on Learning Representations , year =

Selective Attention Improves Transformer , author=. The Thirteenth International Conference on Learning Representations , year =

work page

[23] [23]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[24] [24]

Proceedings of the 41st International Conference on Machine Learning , series=

Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , series=

work page

[25] [25]

Nature , volume=

Solving olympiad geometry without human demonstrations , author=. Nature , volume=. 2024 , publisher=

work page 2024

[26] [26]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

work page 2024

[27] [27]

2023 , doi =

System 2 Attention (is something you might need too) , author=. 2023 , doi =

work page 2023

[28] [28]

Proceedings of the 40th International Conference on Machine Learning , series=

Scaling laws for reward model overoptimization , author=. Proceedings of the 40th International Conference on Machine Learning , series=

work page

[29] [29]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[30] [30]

The Twelfth International Conference on Learning Representations , year=

NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks , author=. The Twelfth International Conference on Learning Representations , year=

work page

[31] [31]

Nature , volume=

Olympiad-level formal mathematical reasoning with reinforcement learning , author=. Nature , volume=. 2026 , publisher=

work page 2026

[32] [32]

Second Conference on Language Modeling , year=

To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning , author=. Second Conference on Language Modeling , year=

work page

[33] [33]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The lessons of developing process reward models in mathematical reasoning , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=. 2025 , doi =

work page 2025

[34] [34]

and Littman, Michael L

Li, Lihong and Walsh, Thomas J. and Littman, Michael L. , title=. International Symposium on Artificial Intelligence and Mathematics (

work page

[35] [35]

and Thomas, Joy A

Cover, Thomas M. and Thomas, Joy A. , title=. 2006 , doi =

work page 2006

[36] [36]

, title=

Tsybakov, Alexandre B. , title=. 2009 , doi =

work page 2009

[37] [37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards clause-learning state space search: Learning to recognize dead-ends , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2016 , doi =

work page 2016

[38] [38]

Advances in Neural Information Processing Systems , volume =

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , author =. Advances in Neural Information Processing Systems , volume =. 2022 , doi =

work page 2022

[39] [39]

2022 , doi =

Teaching Algorithmic Reasoning via In-context Learning , author =. 2022 , doi =

work page 2022

[40] [40]

Transactions on Machine Learning Research , issn =

Linear Algebra with Transformers , author =. Transactions on Machine Learning Research , issn =. 2022 , url =

work page 2022

[41] [41]

Advances in Neural Information Processing Systems , volume =

Exploring Length Generalization in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =. 2022 , doi =

work page 2022

[42] [42]

The Twelfth International Conference on Learning Representations , year =

Teaching Arithmetic to Small Transformers , author =. The Twelfth International Conference on Learning Representations , year =

work page

[43] [43]

2023 , doi =

Length Generalization in Arithmetic Transformers , author =. 2023 , doi =

work page 2023

[44] [44]

Proceedings of the 40th International Conference on Machine Learning , series =

Looped Transformers as Programmable Computers , author =. Proceedings of the 40th International Conference on Machine Learning , series =

work page

[45] [45]

The Twelfth International Conference on Learning Representations , year =

Looped Transformers are Better at Learning Learning Algorithms , author =. The Twelfth International Conference on Learning Representations , year =

work page

[46] [46]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Predicting Propositional Satisfiability via End-to-End Learning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2020 , doi =

work page 2020

[47] [47]

2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =

SATformer: Transformer-Based UNSAT Core Learning , author =. 2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =. 2023 , doi =

work page 2023

[48] [48]

2021 , doi =

Transformer-based Machine Learning for Fast SAT Solvers and Logic Synthesis , author =. 2021 , doi =

work page 2021

[49] [49]

and Lewis, Mike , title =

Press, Ofir and Smith, Noah A. and Lewis, Mike , title =. International Conference on Learning Representations (ICLR) , year =

work page

[50] [50]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Kazemnejad, Amirhossein and Padhi, Inkit and Natesan Ramamurthy, Karthikeyan and Das, Payel and Reddy, Siva , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =. 2023 , doi =

work page 2023

[51] [51]

and Lynce, In\^es and Malik, Sharad , title =

Marques-Silva, Jo\ ao P. and Lynce, In\^es and Malik, Sharad , title =. Handbook of Satisfiability -- Second Edition , editor =. 2021 , doi =

work page 2021

[52] [52]

2006 , doi =

Handbook of Constraint Programming , series =. 2006 , doi =

work page 2006

[53] [53]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2022 , doi =

work page 2022

[54] [54]

Advances in Neural Information Processing Systems , volume=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , volume=. 2022 , doi =

work page 2022

[55] [55]

Advances in Neural Information Processing Systems , volume=

Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[56] [56]

2025 , publisher=

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and others , journal=. 2025 , publisher=

work page 2025

[57] [57]

Advances in Neural Information Processing Systems , volume=

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

work page

[58] [58]

Advances in Neural Information Processing Systems , volume=

Faith and Fate: Limits of Transformers on Compositionality , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[59] [59]

2022 , doi =

Solving Math Word Problems With Process- and Outcome-Based Feedback , author=. 2022 , doi =

work page 2022

[60] [60]

Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William Yang , booktitle=. Logic-. 2023 , doi =

work page 2023

[61] [61]

Advances in Neural Information Processing Systems , volume=

On the Planning Abilities of Large Language Models - A Critical Investigation , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[62] [62]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=. 2023 , doi =

work page 2023

[63] [63]

Gou, Zhibin and Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Yang, Yujiu and Duan, Nan and Chen, Weizhu , booktitle=

work page

[64] [64]

Journal of the ACM , volume =

Davis, Martin and Putnam, Hilary , title =. Journal of the ACM , volume =. 1960 , publisher =

work page 1960

[65] [65]

, title =

Davis, Martin and Logemann, George and Loveland, Donald W. , title =. Communications of the ACM , volume =. 1962 , publisher =

work page 1962

[66] [66]

Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =

Marques-Silva, Jo. Proceedings of the 1996 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =. 1996 , publisher =

work page 1996

[67] [67]

and Madigan, Conor F

Moskewicz, Matthew W. and Madigan, Conor F. and Zhao, Ying and Zhang, Lintao and Malik, Sharad , title =. Proceedings of the 38th Design Automation Conference (DAC) , pages =. 2001 , publisher =

work page 2001

[68] [68]

Computational Intelligence , volume =

Prosser, Patrick , title =. Computational Intelligence , volume =. 1993 , publisher =

work page 1993

[69] [69]

An Extensible

E. An Extensible. Theory and Applications of Satisfiability Testing (SAT 2003) , series =. 2003 , publisher =

work page 2003

[70] [70]

Gomez and

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and. Attention is All You Need , booktitle =. 2017 , publisher =

work page 2017

[71] [71]

International Conference on Learning Representations , year =

Ilya Loshchilov and Frank Hutter , title =. International Conference on Learning Representations , year =

work page

[72] [72]

The Annals of Statistics , volume =

Bradley Efron , title =. The Annals of Statistics , volume =. 1979 , publisher =

work page 1979

[73] [73]

Proceedings of the 31st

Bryan Ford , title =. Proceedings of the 31st. 2004 , publisher =

work page 2004

[74] [74]

Mitchell and Bart Selman and Hector J

David G. Mitchell and Bart Selman and Hector J. Levesque , title =. Proceedings of the 10th National Conference on Artificial Intelligence , pages =. 1992 , publisher =

work page 1992

[75] [75]

Crawford and Larry D

James M. Crawford and Larry D. Auton , title =. Artificial Intelligence , volume =. 1996 , publisher =

work page 1996

[76] [76]

Neural Computation , volume=

Long Short-Term Memory , author=. Neural Computation , volume=. 1997 , publisher=

work page 1997

[77] [77]

IEEE Transactions on Neural Networks , volume=

The Graph Neural Network Model , author=. IEEE Transactions on Neural Networks , volume=. 2009 , publisher=

work page 2009