Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Guohao Dai; Huazhong Yang; Tianyu Fu; Yichen You; Yu Wang; Zekai Chen

arxiv: 2511.08577 · v2 · submitted 2025-11-11 · 💻 cs.CL · cs.AI· cs.LG· cs.PF

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu , Yichen You , Zekai Chen , Guohao Dai , Huazhong Yang , Yu Wang This is my paper

Pith reviewed 2026-05-17 23:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.PF

keywords looped transformersselective latent iterationlatent overthinkingreasoning LLMsdepth-aware LoRAduo-causal attentionhard token refinementThink-at-Hard

0 comments

The pith

A neural decider triggers latent iterations only on likely incorrect tokens to improve reasoning accuracy while skipping 93 percent of steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped transformers perform multiple latent iterations to refine token predictions but often suffer from overthinking that turns initially correct answers into errors. The paper shows an oracle selective policy can boost performance by up to 7.3 percent and introduces Think-at-Hard to approximate this with a lightweight neural decider that activates extra iterations only where the first pass is probably wrong. Depth-aware LoRA modules refocus the model on hard-token refinement during those iterations, while duo-causal attention adds an iteration-depth dimension so information flows across steps without losing sequential parallelism. Experiments across nine benchmarks in math, question answering, and coding confirm gains over both single-pass and always-iterate models, with the same or slightly higher parameter counts.

Core claim

Think-at-Hard is a looped transformer that uses a lightweight neural decider to trigger latent iterations selectively at tokens likely to be incorrect after the standard forward pass. During selected iterations, depth-aware LoRA modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention across both the token sequence and an additional iteration-depth dimension, enabling cross-iteration information flow while preserving full sequential parallelism. This selective policy produces consistent accuracy improvements on reasoning tasks.

What carries the argument

Lightweight neural decider that selects tokens for latent iteration, combined with depth-aware LoRA for refinement and duo-causal attention for cross-iteration information flow.

If this is right

With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4 percent while skipping iterations on 93 percent of tokens.
TaH exceeds single-iteration Qwen3 baselines by 3.0-3.8 percent.
Allowing less than 3 percent extra parameters from the LoRA and decider modules raises gains to 5.3-6.2 percent over always-iterate and 6.1-6.8 percent over single-pass baselines.
Improvements hold consistently across nine benchmarks covering math, QA, and coding tasks.
An oracle iteration policy demonstrates up to 7.3 percent headroom if token selection can be made perfect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective skipping pattern could be transferred to other iterative refinement architectures to cut average inference cost without sacrificing peak accuracy.
Duo-causal attention may support deeper iteration stacks in future models while retaining parallelism benefits.
Jointly training the decider with the base model instead of on frozen first-pass outputs might further reduce the gap to the oracle policy.
Similar overthinking effects may appear in non-transformer iterative systems, suggesting the decider idea could generalize beyond language models.

Load-bearing premise

A lightweight neural decider trained on first-pass outputs can reliably identify which tokens would benefit from extra latent iterations without introducing new errors on tokens that were already correct.

What would settle it

Measuring whether TaH accuracy drops below the always-iterate baseline or fails to capture most of the oracle's 7.3 percent potential on the same nine benchmarks would falsify the value of the learned selective policy.

Figures

Figures reproduced from arXiv: 2511.08577 by Guohao Dai, Huazhong Yang, Tianyu Fu, Yichen You, Yu Wang, Zekai Chen.

**Figure 2.** Figure 2: TaH Overview. (a) Regular causal attention: tokens attend only to previous positions. (b) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics of the LLM backbone on Qwen3-0.6B-Base. TaH converges rapidly and achieves lower perplexity. 1.0 0.9 0.8 0.7 0.6 0.5 Continuation Threshold 80 82 84 Accuracy 82.9 (0.0%) 84.5 (7.5%) 84.5 (19.9%) 82.6 (28.4%) 80.7 (35.1%) 80.3 (41.3%) TaH Standard AlwaysThink [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Next-token prediction changes across iterations. Top2 tokens that think-twice most are visualized. 5.4 BEHAVIOR ANALYSIS Latent Overthinking. To analyze latent thinking patterns, we verbalize tokens from all iteration depths using their last-layer hidden states. The oracle method uses the oracle policy π from Section 4.3 for iteration decision. (1) Generation. Since ground-truth tokens are unavailable dur… view at source ↗

**Figure 6.** Figure 6: Iteration-decider accuracy vs. epoch (Qwen3-0.6B). [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: TaH duo-causal attention pattern. A.5 LIMITATIONS AND FUTURE WORK Comparison with Official Qwen3 Models. Official Qwen3 models are trained on different data distributions and scales, and use different training procedures, including on-policy distillation (Yang et al., 2025). By contrast, our models use SFT only on limited, publicly accessible data. Consequently, performance may differ between the two. (a)… view at source ↗

read the original abstract

Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. In this work, we ask whether selectively skipping latent iterations may improve accuracy. We reveal significant potential with an oracle iteration policy that boosts model performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the LLM's objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider modules, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TaH adds a learned decider plus depth-aware LoRA and duo-causal attention to looped transformers, delivering 3-4% gains over baselines while skipping most iterations, but the decider's ability to avoid new errors remains the load-bearing question.

read the letter

The key takeaway is that this paper shows a way to selectively apply latent iterations in looped transformers using a learned decider, leading to better reasoning performance with less compute on most tokens. They build on looped transformers by adding a lightweight neural network that decides per token whether to do extra iterations after the first pass. To support this, they use depth-aware LoRA modules that adapt the model specifically for refining hard tokens, and a duo-causal attention that handles both the sequence and the iteration depth. This setup allows parallel computation while sharing info across iterations. The results indicate gains of 3.8 to 4.4 percent over always-iterate baselines with the same number of parameters, and they skip iterations for 93 percent of tokens. When adding a few percent more parameters for the extras, the improvements go up to around 5 to 6 percent over the Qwen3 single-pass models. An oracle that knows the best policy shows even more room, up to 7.3 percent. What the paper does well is identify the overthinking issue where extra iterations sometimes hurt correct predictions, and then demonstrate that selective application can capture part of the potential improvement across math, question answering, and coding tasks. Releasing the code is a plus for reproducibility. The soft spots are around the decider itself. Since it's lightweight and trained on first-pass results, it needs to correctly identify which tokens would improve with iteration and avoid triggering iterations on ones that are already right or would get worse. The fact that they only realize about half the oracle gain suggests there might be some misclassifications, but the paper would need to show error rates or ablations to confirm the decider isn't introducing new problems. Also, details on training the decider, choice of thresholds, and whether the gains hold with different random seeds or stronger baselines would help. This work is mainly for people building or studying efficient reasoning models under compute limits. Someone looking for incremental architecture tweaks to looped methods would find it useful. It has enough new elements and experimental support to warrant a serious referee, though the review would likely focus on verifying the decider's reliability and the fairness of comparisons. I would send this to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Think-at-Hard (TaH), a looped transformer variant that performs selective latent iterations to refine token predictions in reasoning LLMs. It identifies a latent overthinking issue where additional iterations can revise correct first-pass predictions into errors. TaH trains a lightweight neural decider on first-pass outputs to trigger iterations only for likely-incorrect tokens, augments the model with depth-aware LoRA modules for hard-token refinement, and uses duo-causal attention to enable cross-iteration information flow while preserving parallelism. An oracle policy is shown to yield up to 7.3% gains; the proposed method reports 3.8-4.4% improvement over always-iterate baselines (identical parameter count) and 3.0-3.8% over single-iteration Qwen3 baselines across nine math/QA/coding benchmarks, while skipping iterations on 93% of tokens. Additional LoRA/decider parameters raise gains to 5.3-6.8%. Code is released.

Significance. If the empirical results hold under scrutiny, the work offers a practical route to higher reasoning accuracy in parameter-constrained LLMs by avoiding both unnecessary computation and harmful overthinking. The oracle upper bound provides a clear target, the 93% skip rate demonstrates efficiency, and the consistent gains across task types are noteworthy. Releasing code is a clear strength that aids verification. The approach sits at the intersection of iterative refinement and adaptive computation, with potential to influence future designs of looped or recurrent-style transformers.

major comments (2)

[Method and Experiments sections] The central performance claims rest on the lightweight neural decider reliably identifying tokens that benefit from iteration without introducing revision-to-error on already-correct tokens. However, the manuscript provides insufficient detail on decider training labels (ground-truth vs. self-supervised), decision threshold selection, and quantitative error analysis (e.g., false-positive rate on correct tokens or ablation removing the decider). This is load-bearing because the gap between the 7.3% oracle and the realized 3.0-4.4% gains could be explained by decider misclassifications, directly affecting whether selective iteration improves or degrades the always-iterate baseline.
[Experiments] Table or figure reporting main results: the 3.8-4.4% gains over always-iterate baselines and 3.0-3.8% over single-pass Qwen3 are presented without reported standard deviations across multiple seeds, number of evaluation runs, or statistical significance tests. Given known sensitivity of LLM benchmark scores to implementation details and prompt formatting, these omissions make it difficult to judge whether the reported margins are robust.

minor comments (2)

[Abstract] The abstract states gains on 'nine benchmarks' but does not name them; adding the list (e.g., GSM8K, MATH, HumanEval, etc.) would improve immediate clarity for readers.
[Method] Notation for the duo-causal attention and depth-aware LoRA could be clarified with a small diagram or explicit equations showing how the iteration dimension is integrated into the attention mask.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes we will incorporate in the revised manuscript.

read point-by-point responses

Referee: [Method and Experiments sections] The central performance claims rest on the lightweight neural decider reliably identifying tokens that benefit from iteration without introducing revision-to-error on already-correct tokens. However, the manuscript provides insufficient detail on decider training labels (ground-truth vs. self-supervised), decision threshold selection, and quantitative error analysis (e.g., false-positive rate on correct tokens or ablation removing the decider). This is load-bearing because the gap between the 7.3% oracle and the realized 3.0-4.4% gains could be explained by decider misclassifications, directly affecting whether selective iteration improves or degrades the always-iterate baseline.

Authors: We agree that the original description of the neural decider was insufficiently detailed. We will revise the Method section to explicitly describe the self-supervised label generation process (comparing first-pass token predictions against whether later iterations correct or introduce errors), the threshold selection procedure performed on a validation split, and new quantitative error analysis. This will include the false-positive rate on tokens that were already correct after the first pass as well as an ablation that removes the decider. These additions will clarify how the realized gains relate to the oracle upper bound and confirm that selective iteration does not degrade the always-iterate baseline. revision: yes
Referee: [Experiments] Table or figure reporting main results: the 3.8-4.4% gains over always-iterate baselines and 3.0-3.8% over single-pass Qwen3 are presented without reported standard deviations across multiple seeds, number of evaluation runs, or statistical significance tests. Given known sensitivity of LLM benchmark scores to implementation details and prompt formatting, these omissions make it difficult to judge whether the reported margins are robust.

Authors: We acknowledge the value of reporting variability and statistical tests for LLM benchmark results. Although the primary numbers were obtained from single evaluation runs per configuration, we will update the Experiments section and main results table to report standard deviations over at least three random seeds, state the number of evaluation runs performed, and add statistical significance tests (e.g., McNemar’s test) for the accuracy differences. These revisions will strengthen the evidence for the robustness of the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper is an empirical proposal for Think-at-Hard (TaH), a selective-iteration looped transformer. All performance claims (3.8-4.4% gains over always-iterate baselines, 3.0-3.8% over single-pass Qwen3, oracle headroom of 7.3%) are measured directly on nine benchmarks against explicit baselines. The lightweight neural decider, depth-aware LoRA, and duo-causal attention are introduced as architectural components whose effectiveness is assessed via ablation and comparison experiments rather than derived from any self-referential equation or fitted quantity. No derivation chain, uniqueness theorem, or ansatz is invoked that reduces the reported results to quantities defined inside the paper itself. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that first-pass token predictions contain sufficient signal for a lightweight decider to select beneficial iterations and that depth-aware LoRA can shift the objective without destabilizing the base model. No explicit free parameters beyond standard training hyperparameters are described in the abstract.

pith-pipeline@v0.9.0 · 5640 in / 1126 out tokens · 30178 ms · 2026-05-17T23:04:41.575846+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass... duo-causal attention... depth-aware Low-Rank Adaptation (LoRA) modules
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

oracle iteration policy that boosts model performance by up to 7.3%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
cs.LG 2026-04 unverdicted novelty 7.0

A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

LASAR uses two-stage supervised training plus reinforcement learning to ground semantic IDs, align latent reasoning trajectories to CoT hidden states via KL divergence, and adaptively choose reasoning depth, halving a...
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 3 Pith papers

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

think-twice

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

think-twice

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv