arxiv: 2601.21214 · v2 · submitted 2026-01-29 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Zhaoyi Li , Jiatong Li , Gangwei Jiang , Linqi Song , Defu Lian , Ying Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:21 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords chain-of-thought reasoninghop generalizationattention headstest-time interventionlarge language modelsreasoning errorsinference correctionmulti-hop reasoning

0 comments

The pith

Certain attention heads amplify wrong reasoning paths and cause LLMs to fail when chains exceed training lengths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reasoning failures in large language models on longer hop counts than seen in training arise from localized errors at specific token positions. These errors result from a subset of attention heads that favor incorrect trajectories over correct ones during the reasoning process. Removing these heads individually at inference time often recovers the right answer. The authors introduce a lightweight test-time method that detects and deactivates the problematic heads dynamically, which improves performance across tasks and models without any retraining.

Core claim

Errors in reasoning hop generalization concentrate at particular token positions due to internal competition among attention heads. A subset of these heads, termed erroneous processing heads, tip the balance toward incorrect reasoning paths. Selectively removing these heads at inference time restores correct predictions, and a dynamic test-time correction procedure that identifies and deactivates them improves generalization without retraining.

What carries the argument

Erroneous processing heads (ep heads): specific attention heads that amplify incorrect reasoning trajectories while suppressing correct ones during multi-hop inference.

If this is right

Errors concentrate at token positions of a few critical types rather than spreading uniformly across the sequence.
Individual removal of erroneous processing heads during inference often restores correct predictions.
Dynamic identification and deactivation of these heads at test time improves hop generalization across multiple domains and models.
The intervention requires no retraining or changes to model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hop generalization failures may stem from specific internal mechanisms rather than a broad limit on the model's ability to handle longer chains.
Similar head-level interventions could apply to other out-of-distribution scenarios in language models.
Training procedures might be adjusted to reduce specialization of heads on particular hop lengths.
These heads may play roles in other tasks, so selective deactivation requires care to avoid side effects.

Load-bearing premise

The identified erroneous processing heads are causally responsible for the generalization failures and can be accurately detected and deactivated at test time without harming other capabilities.

What would settle it

An experiment showing that deactivating the identified heads fails to restore correct predictions on out-of-distribution hops or reduces accuracy on standard in-distribution reasoning tasks would falsify the claim.

Figures

Figures reproduced from arXiv: 2601.21214 by Defu Lian, Gangwei Jiang, Jiatong Li, Linqi Song, Ying Wei, Zhaoyi Li.

**Figure 2.** Figure 2: Overall accuracy (green curve) and error proportion for key error types (blue bar) and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the idea of comparative study and constructing the correct prediction set Scorr and erroneous prediction set Serr with the multi-digit multiplication (MDM) task (error type 2, digit decomposition error). Each sample contains the input reasoning context (i.e., the concatenation of input problem and part of the model generated response) and the output predicted token. Scorr and Serr contains … view at source ↗

**Figure 4.** Figure 4: Locating answer-writing heads with our proposed measure for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Decoding the information in aw heads: a 22 1 and a 23 11. (a,c) and (b,d) show their top10 decoded tokens before (i.e., the original model) and after knocking out the processing head a 0 7 . 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Locating processing heads for correct and erroneous predictions and measuring the effects [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Probability distribution of ground-truth answers before and after model intervention. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of TCR and TCR-gold changing with reasoning hop number. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 27.** Figure 27: MDM is also easily extensible. We can control the digit number of the operands. In our [PITH_FULL_IMAGE:figures/full_fig_p022_27.png] view at source ↗

**Figure 9.** Figure 9: Possible error types in a Parity-NL problem. The [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Possible error types in a MDM problem. The [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Overall accuracy (green curve) and error proportion for key error types (blue bar) and [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Comparing our proposed answer-writing head locating measure with other baseline meth [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Comparing the answer-writing heads localization result with randomly select tokens and predicted tokens. confirms the effectiveness of the localization result obtained with model’s predicted token (Serr and Scorr). Locating answer-writing heads with other tasks and models We show the additional locating results of answer-writing heads (with our proposed measure saw-head(a l i )) on all seven tasks and wit… view at source ↗

**Figure 14.** Figure 14: Detailed inspecting results of a 22 1 in the Qwen2.5-7B-Instruct model on the predictions at the token positions of Parity-NL error type 2. Note that the Parity-NL results with Qwen2.5-7B-Instruct have been shown in the main text (Figure 6) and all results are obtained by averaging over a set Serr containing 10 independently and randomly sampled instances. We can consistently observe that knocking out so… view at source ↗

**Figure 15.** Figure 15: The training logs of the head selector for Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: Additional results about locating answer-writing heads with Qwen2.5-7B-Instruct on [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗

**Figure 18.** Figure 18: Additional results about locating answer-writing heads with Phi-3-Instruct on both cor [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

**Figure 19.** Figure 19: Additional results about locating answer-writing heads with LLaMA3-8B-Instruct on [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Detailed inspecting results of a 22 4 in the Qwen2.5-7B-Instruct model on the predictions at the token positions of Parity-NL error type 2. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗

**Figure 21.** Figure 21: Detailed inspecting results of a 23 11 in the Qwen2.5-7B-Instruct model on the predictions at the token positions of Parity-NL error type 2. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_21.png] view at source ↗

**Figure 22.** Figure 22: Additional results about locating erroneous processing heads with Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p043_22.png] view at source ↗

**Figure 23.** Figure 23: Additional results about locating erroneous processing heads with Phi-3-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p044_23.png] view at source ↗

**Figure 24.** Figure 24: Additional results about locating erroneous processing heads with LLaMA3-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p045_24.png] view at source ↗

**Figure 25.** Figure 25: Additional results about locating erroneous processing heads with Qwen3-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗

**Figure 26.** Figure 26: An example of the Parity-NL problem. Multi-Digit Multiplication (MDM) Example Problem: 326 * 3589 =? please think step-by-step. Answer: Let’s break down the multiplication of 326 and 3589 step-by-step. We will multiply 326 by each digit of 3589 and then add the results. **Step 1: Multiply 326 by 9 (units place of 3589)** [ 326 * 9 = 2934 ] (Shift this result 0 places to the left: [ 2934 ]) **Step 2: Multi… view at source ↗

**Figure 27.** Figure 27: An example of the MDM problem. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_27.png] view at source ↗

**Figure 28.** Figure 28: An example of the LLC problem. Multi-Operand Addition and Subtraction (MOAS) Example Problem: 17 + 48 - 25 + 99 - 4 - 85 + 19 + 68 - 31 - 88 = ? Please think step-by-step. Answer: Certainly! Let us solve the equation step by step: 1. Start with 17. 2. Add 48: 17 + 48 = 65. 3. Subtract 25: 65 - 25 = 40. 4. Add 99: 40 + 99 = 139. 5. Subtract 4: 139 - 4 = 135. 6. Subtract 85: 135 - 85 = 50. 7. Add 19: 50 + 1… view at source ↗

**Figure 29.** Figure 29: An example of the MOAS problem. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_29.png] view at source ↗

**Figure 30.** Figure 30: An example of the CLF problem. Object Counting (ObjC) Example Problem: Alessia got 4 bananas. Gregory picked up 6 watermelons. Lila acquired 4 lemons. Lila bought 3 pineapples. Alan obtained 3 strawberries. Alessia obtained 7 lemons. Alan picked up 9 watermelons. Gregory picked up 6 apples. Alessia picked up 2 oranges. Alan acquired 8 apples. How many fruit items are mentioned above? Please think step-by-… view at source ↗

**Figure 31.** Figure 31: An example of the ObjC problem. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_31.png] view at source ↗

**Figure 32.** Figure 32: An example of the NumS problem. Possible Error Types in a LLC Example Problem: Take the last letters of the words in “garden sound valid potato numb write tiger truth sound hotel” and concatenate them. Answer: Let us concatenate the last letters one by one. 1. The last letter of ‘garden’ [Type 1: Word Recall Error, ‘garden’ → ‘valid’ ] is ‘n’. The current concatenating result is ‘n’. 2. The last letter of… view at source ↗

**Figure 33.** Figure 33: Possible error types in a LLC problem. The [PITH_FULL_IMAGE:figures/full_fig_p050_33.png] view at source ↗

**Figure 34.** Figure 34: Possible error types in a MOAS problem. The [PITH_FULL_IMAGE:figures/full_fig_p050_34.png] view at source ↗

**Figure 35.** Figure 35: Possible error types in a CLF problem. The [PITH_FULL_IMAGE:figures/full_fig_p051_35.png] view at source ↗

**Figure 36.** Figure 36: Possible error types in a ObjC problem. The [PITH_FULL_IMAGE:figures/full_fig_p051_36.png] view at source ↗

**Figure 37.** Figure 37: Possible error types in a NumS problem. The [PITH_FULL_IMAGE:figures/full_fig_p052_37.png] view at source ↗

read the original abstract

Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Deactivating a few attention heads at test time can recover hop generalization in LLMs, but the evidence does not yet isolate this from generic ablation effects.

read the letter

The main thing to know is that this paper traces reasoning hop failures to a small number of attention heads and claims that deactivating them during inference restores performance on longer chains. They run experiments on multiple tasks and models, showing that errors are not spread out but concentrated at certain token positions. Closer inspection leads them to identify these erroneous processing heads that amplify wrong trajectories. Removing individual ones often gets the model back on track. The test-time correction they propose dynamically finds and turns off these heads, and the results suggest consistent gains in hop generalization across different LLMs. What they do well is the breadth of the evaluation and the focus on an intervention that is cheap to apply. It builds on existing CoT work by adding an internal view of where things go wrong. The soft spots are around causality and controls. The argument that these heads are specifically responsible would be stronger with ablations of matched random heads to show the effect is not generic. Without that, and without data on whether in-distribution performance drops or other capabilities suffer, the claim stays somewhat correlational. The abstract also leaves the exact head identification method a bit vague, so the full paper needs to spell out the thresholds or selection rules clearly. This paper is for researchers working on LLM reasoning and interpretability. A reader who wants practical ways to extend reasoning length without scaling would find the method useful to test. It deserves a serious referee because the empirical findings are concrete enough to evaluate and the idea has potential if the mechanism holds up. I would recommend sending it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM failures in chain-of-thought reasoning hop generalization arise because errors concentrate at specific token positions due to a small set of erroneous processing heads (ep heads) that amplify incorrect trajectories while suppressing correct ones. It shows that removing individual ep heads at inference often restores correct predictions and proposes a lightweight test-time correction method that dynamically identifies and deactivates these heads, reporting consistent gains across multiple tasks and models.

Significance. If the causal role of the identified ep heads holds after proper controls, the work supplies a mechanistic account of a well-documented limitation in CoT reasoning and a practical, training-free intervention that could improve out-of-distribution hop performance. The systematic experiments across tasks and models constitute a strength, though the absence of isolating controls limits the strength of the causal interpretation.

major comments (2)

[Experimental results] Experimental results (as summarized in the abstract): the central claim that ep-head deactivation specifically corrects hop-generalization failures is not isolated from generic intervention effects. No ablation results are reported for an equal number of randomly selected heads or non-ep heads, nor are in-distribution task degradations or unrelated capability impacts quantified, leaving the causality correlational rather than demonstrated.
[Method] Head-identification procedure (abstract and method description): the criteria used to designate ep heads, including any thresholds, token-position error concentration metrics, or statistical tests, are not specified in sufficient detail. This detail is load-bearing for both the mechanistic interpretation and the reproducibility of the proposed test-time correction.

minor comments (1)

[Abstract] The abstract introduces 'reasoning hop generalization' without a concise definition; adding one sentence would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger controls and more precise methodological details are needed to support the causal claims and ensure reproducibility. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental results] Experimental results (as summarized in the abstract): the central claim that ep-head deactivation specifically corrects hop-generalization failures is not isolated from generic intervention effects. No ablation results are reported for an equal number of randomly selected heads or non-ep heads, nor are in-distribution task degradations or unrelated capability impacts quantified, leaving the causality correlational rather than demonstrated.

Authors: We acknowledge that the current experiments leave the specificity of ep-head deactivation correlational. In the revised version we will add controlled ablations that deactivate an equal number of randomly chosen heads and non-ep heads, and we will report performance on in-distribution versions of the tasks plus unrelated capabilities to quantify any generic degradation. These results will be presented in a new subsection of the experiments. revision: yes
Referee: [Method] Head-identification procedure (abstract and method description): the criteria used to designate ep heads, including any thresholds, token-position error concentration metrics, or statistical tests, are not specified in sufficient detail. This detail is load-bearing for both the mechanistic interpretation and the reproducibility of the proposed test-time correction.

Authors: We agree that the identification criteria must be stated explicitly. The revision will expand the Methods section to define the precise token-position error concentration metric, the numerical thresholds applied to select ep heads, and the statistical tests used to validate their significance. These details will also be accompanied by pseudocode for the dynamic deactivation procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical intervention study on LLM attention heads

full rationale

The paper performs systematic empirical analysis of token-level errors in chain-of-thought reasoning across multiple domains and models. It identifies patterns in attention heads through direct observation and ablation experiments, then proposes a test-time deactivation method. No mathematical derivations, equations, or parameter-fitting steps are described that would reduce any claimed prediction or result to its own inputs by construction. The central claims rest on experimental outcomes rather than self-definitional mappings, fitted inputs renamed as predictions, or load-bearing self-citations. This is a standard non-circular empirical investigation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of identifiable erroneous processing heads whose selective deactivation improves generalization; this introduces one new entity and one domain assumption about attention-head independence.

free parameters (1)

ep-head identification threshold or criterion
The rule used to decide which heads count as erroneous and should be deactivated is not specified and likely tuned on observed data.

axioms (1)

domain assumption Individual attention heads can be deactivated at inference time without collapsing overall model capability
Invoked when proposing removal of ep heads as a safe intervention.

invented entities (1)

erroneous processing heads (ep heads) no independent evidence
purpose: To localize and explain the source of reasoning-hop errors
New term coined to describe the attention heads that amplify incorrect trajectories.

pith-pipeline@v0.9.0 · 5509 in / 1183 out tokens · 28916 ms · 2026-05-16T10:21:47.594383+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Ronald B. Dekker, Fabian Otto, and Christopher Summerfield. Curriculum learning for hu- man compositional generalization.Proceedings of the National Academy of Sciences, 119(41): e2205582119, 2022. doi: 10.1073/pnas.2205582119. URLhttps://www.pnas.org/doi/ abs/10.1073/pnas.2205582119. Subhabrata Dutta, Joykirat Singh, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2205582119 2022
[2]

https://transformer-circuits.pub/2021/framework/index.html. Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov. Fact-checking the output of large language models via token- level uncertainty quantificatio...

work page doi:10.18653/v1/2024 2021
[3]

nostalgebraist

URLhttps://openreview.net/forum?id=AjXkRZIvjB. nostalgebraist. interpreting gpt: the logit lens.https://www.lesswrong.com/posts/ AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, and Neel Nanda. Real-time detection of hallucinated entities in long-form generation, 2025. URLh...

work page doi:10.1145/3690624.3709169 2020
[4]

Changnan Xiao and Bing Liu

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. Changnan Xiao and Bing Liu. Generalizing reasoning problems to longer lengths. InThe Thirteenth International Conference on Learning Representations, 2025. Shaotian Yan, Chen Shen, Wenxiao Wang, Liang Xie, Junjie Liu, and Jieping Ye. Don’t...

work page arXiv 2022
[5]

doi: 10.18653/v1/2024.acl-long.550

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.550. URL https://aclanthology.org/2024.acl-long.550/. Xinhao Yao, Ruifeng Ren, Yun Liao, and Yong Liu. Unveiling the mechanisms of explicit cot train- ing: How chain-of-thought enhances reasoning generalization.arXiv preprint arXiv:2502.04667, 2025. Tian Ye, Zicheng Xu, Yuanzhi Li, ...

work page doi:10.18653/v1/2024.acl-long.550 2024
[6]

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

URLhttps://aclanthology.org/2024.emnlp-main.191/. Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, and Hao Wang. Tokur: Token-level un- certainty estimation for large language model reasoning, 2025a. URLhttps://arxiv.org/ abs/2505.11737. 16 Published as a conference...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.880 2024
[7]

in the Qwen2.5-7B- Instruct model on the predictions at the token positions of Parity-NL error type 2. We show the results of10originally erroneously predictedsamples accompanied with the inspecting results after intervention the model (i.e., knocking out an erroneous processing heada 0

work page
[8]

dola_layers

in Figure 14, Figure 20 and Figure 21. F.4 ADDITIONALRESULTS ABOUTLOCATING ANDANALYZINGPROCESSINGHEADS We show the locating results of erroneous processing heads (i.e.,ep heads) and the effects of knock- ing out individual heads on correcting the erroneous prediction (i.e., the probabilities of predicting ground-truth tokens) with four models and all seve...

work page 2026
[9]

The coin starts heads up

work page
[10]

(Coin remains heads up.)

Matthew doesn’t flip the coin. (Coin remains heads up.)

work page
[11]

(Coin becomes tails up.)

Daniel flips the coin. (Coin becomes tails up.)

work page
[12]

(Coin becomes heads up.)

Carter flips the coin. (Coin becomes heads up.)

work page
[13]

(Coin remains heads up.)

Carter doesn’t flip the coin. (Coin remains heads up.)

work page
[14]

(Coin becomes tails up.)

Ethan flips the coin. (Coin becomes tails up.)

work page
[15]

(Coin remains tails up.)

Matthew doesn’t flip the coin. (Coin remains tails up.)

work page
[16]

(Coin becomes heads up.)

Ethan flips the coin. (Coin becomes heads up.)

work page
[17]

(Coin remains heads up.)

Daniel doesn’t flip the coin. (Coin remains heads up.)

work page
[18]

(Coin becomes tails up.)

Matthew flips the coin. (Coin becomes tails up.)

work page
[19]

garden sound valid potato numb write tiger truth sound hotel

Ethan flips the coin. (Coin becomes heads up.) After going through each step, we see that the coin ends up heads up after the last flip by Ethan. Therefore, the coin is heads up. Figure 26: An example of the Parity-NL problem. Multi-Digit Multiplication (MDM) Example Problem: 326 * 3589 =? please think step-by-step. Answer: Let’s break down the multiplica...

work page 2026
[20]

The current concatenating result is ‘n’

The last letter of ‘garden’ is ‘n’. The current concatenating result is ‘n’

work page
[21]

The current concatenating result is ‘nd’

The last letter of ‘sound’ is ‘d’. The current concatenating result is ‘nd’

work page
[22]

The current concatenating result is ‘ndd’

The last letter of ‘valid’ is ‘d’. The current concatenating result is ‘ndd’

work page
[23]

The current concatenating result is ‘nddo’

The last letter of ‘potato’ is ‘o’. The current concatenating result is ‘nddo’

work page
[24]

The current concatenating result is ‘nddob’

The last letter of ‘numb’ is ‘b’. The current concatenating result is ‘nddob’

work page
[25]

The current concatenating result is ‘nddobe’

The last letter of ‘write’ is ‘e’. The current concatenating result is ‘nddobe’

work page
[26]

The current concatenating result is ‘nddober’

The last letter of ‘tiger’ is ‘r’. The current concatenating result is ‘nddober’

work page
[27]

The current concatenating result is ‘nddoberh’

The last letter of ‘truth’ is ‘h’. The current concatenating result is ‘nddoberh’

work page
[28]

The current concatenating result is ‘nddoberhd’

The last letter of ‘sound’ is ‘d’. The current concatenating result is ‘nddoberhd’

work page
[29]

The current concatenating result is ‘nddoberhdl’

The last letter of ‘hotel’ is ‘l’. The current concatenating result is ‘nddoberhdl’. Therefore, the answer is ‘nddoberhdl’. Figure 28: An example of the LLC problem. Multi-Operand Addition and Subtraction (MOAS) Example Problem: 17 + 48 - 25 + 99 - 4 - 85 + 19 + 68 - 31 - 88 = ? Please think step-by-step. Answer: Certainly! Let us solve the equation step by step:

work page
[30]

Add 48: 17 + 48 = 65

work page
[31]

Subtract 25: 65 - 25 = 40

work page
[32]

Add 99: 40 + 99 = 139

work page
[33]

Subtract 4: 139 - 4 = 135

work page
[34]

Subtract 85: 135 - 85 = 50

work page
[35]

Add 19: 50 + 19 = 69

work page
[36]

Add 68: 69 + 68 = 137

work page
[37]

Subtract 31: 137 - 31 = 106

work page
[38]

logs = [‘d1/’, ‘d2/’, ‘../’, ‘d2/’, ‘d3/’, ‘./’, ‘../’, ‘d3/’, ‘d4/’, ‘./’]

Subtract 88: 106 - 88 = 18. So, the final answer is 18. Figure 29: An example of the MOAS problem. 48 Published as a conference paper at ICLR 2026 Crawler Log Folder (CLF) Example Python code: def min operations(logs): folder depth = 0 for operation in logs: if operation == ‘../’: folder depth = max(0, folder depth - 1) elif operation != ‘./’: folder dept...

work page 2026
[40]

**Operation ‘d1/’**: folder depth = folder depth + 1 = 0 + 1 = 1

work page
[43]

**Operation ‘d2/’**: folder depth = folder depth + 1 = 1 + 1 = 2

work page
[45]

(no change)

**Operation ‘./’**: folder depth = 3. (no change)

work page
[46]

**Operation ‘../’**: folder depth = max(0, folder depth - 1) = max(0, 2) = 2

work page
[47]

**Operation ‘d3/’**: folder depth = folder depth + 1 = 2 + 1 = 3

work page
[48]

**Operation ‘d4’**: folder depth = folder depth + 1 = 3 + 1 = 4

work page
[49]

min operation()

**Operation ‘./’**: folder depth = 4. (no change) Therefore, the final return value (folder depth) for the “min operation()” function with the input “logs” is 4. So the answer is 4. Figure 30: An example of the CLF problem. Object Counting (ObjC) Example Problem: Alessia got 4 bananas. Gregory picked up 6 watermelons. Lila acquired 4 lemons. Lila bought 3...

work page
[51]

**Current Total Number**: 0 + 4 = 4

**Sentence**: Alessia got 4 bananas. **Current Total Number**: 0 + 4 = 4

work page
[52]

**Current Total Number**: 4 + 6 = 10

**Sentence**: Gregory picked up 6 watermelons. **Current Total Number**: 4 + 6 = 10

work page
[53]

**Current Total Number**: 10 + 4 = 14

**Sentence**: Lila acquired 4 lemons. **Current Total Number**: 10 + 4 = 14

work page
[54]

**Current Total Number**: 14 + 3 = 17

**Sentence**: Lila bought 3 pineapples. **Current Total Number**: 14 + 3 = 17

work page
[55]

**Current Total Number**: 17 + 3 = 20

**Sentence**: Alan obtained 3 strawberries. **Current Total Number**: 17 + 3 = 20

work page
[56]

**Current Total Number**:20 + 7 = 27

**Sentence**: Alessia obtained 7 lemons. **Current Total Number**:20 + 7 = 27

work page
[57]

**Current Total Number**: 27 + 9 = 36

**Sentence**: Alan picked up 9 watermelons. **Current Total Number**: 27 + 9 = 36

work page
[58]

**Current Total Number**: 36 + 6 = 42

**Sentence**: Gregory picked up 6 apples. **Current Total Number**: 36 + 6 = 42

work page
[59]

**Current Total Number**: 42 + 2 = 44

**Sentence**: Alessia picked up 2 oranges. **Current Total Number**: 42 + 2 = 44

work page
[60]

**Current Total Number**: 44 + 8 = 52

**Sentence**: Alan acquired 8 apples. **Current Total Number**: 44 + 8 = 52. Therefore, the total number of fruit items mentioned is 52. Figure 31: An example of the ObjC problem. 49 Published as a conference paper at ICLR 2026 Number Student (NumS) Example Python code: def solution(startTime: List[int], endTime: List[int], queryTime: int): count = 0 for ...

work page 2026
[61]

**range**: [1, 4], 4 is within this range

startTime[0] = 1, endTime[0] = 4. **range**: [1, 4], 4 is within this range. **count**: 0 + 1 = 1

work page
[62]

**range**: [2, 8], 4 is within this range

startTime[1] = 2, endTime[1] = 8. **range**: [2, 8], 4 is within this range. **count**: 1 + 1 = 2

work page
[63]

**range**: [3, 5], 4 is within this range

startTime[2] = 3, endTime[2] = 5. **range**: [3, 5], 4 is within this range. **count**: 2 + 1 = 3

work page
[64]

**range**: [4, 5], 4 is within this range

startTime[3] = 4, endTime[3] = 5. **range**: [4, 5], 4 is within this range. **count**: 3 + 1 = 4

work page
[65]

**range**: [6, 9,] 4 is not within this range

startTime[4] = 6, endTime[4] = 9. **range**: [6, 9,] 4 is not within this range. **count**: 4

work page
[66]

**range**: [2, 5], 4 is within this range

startTime[5] = 2, endTime[5] = 5. **range**: [2, 5], 4 is within this range. **count**: 4 + 1 = 5

work page
[67]

**range**: [5, 7], 4 is not within this range

startTime[6] = 5, endTime[6] = 7. **range**: [5, 7], 4 is not within this range. **count**: 5

work page
[68]

**range**: [1, 3], 4 is not within this range

startTime[7] = 1, endTime[7] = 3. **range**: [1, 3], 4 is not within this range. **count**: 5

work page
[69]

**range**: [2, 4], 4 is within this range

startTime[8] = 2, endTime[8] = 4. **range**: [2, 4], 4 is within this range. **count**: 5 + 1 = 6

work page
[70]

garden sound valid potato numb write tiger truth sound hotel

startTime[9] = 4, endTime[9] = 9. **range**: [4, 9], 4 is within this range. **count**: 6 + 1 = 7. After checking all events, the function returns the final count, which is 7. Figure 32: An example of the NumS problem. Possible Error Types in a LLC Example Problem: Take the last letters of the words in “garden sound valid potato numb write tiger truth sou...

work page
[71]

The current concatenating result is ‘n’

The last letter of ‘garden’[Type 1: Word Recall Error, ‘garden’→‘valid’ ]is ‘n’. The current concatenating result is ‘n’

work page
[72]

The current concatenating result is ‘nd’

The last letter of ‘sound’ is ‘d’[Type 2: Letter Error, ‘d’→‘n’ ]. The current concatenating result is ‘nd’

work page
[73]

The current concatenating result is ‘ndd’[Type 3: Concatenation Error, ‘ndd’→‘dnd’ ]

The last letter of ‘valid’ is ‘d’. The current concatenating result is ‘ndd’[Type 3: Concatenation Error, ‘ndd’→‘dnd’ ]. —————— Omitting intermediate six steps ... ——————

work page
[74]

The current concatenating result is ‘nddoberhdl’

The last letter of ‘hotel’ is ‘l’. The current concatenating result is ‘nddoberhdl’. [Type 4/5: Reasoning Less/More Steps ] Therefore, the answer is ‘nddoberhdl’. Figure 33: Possible error types in a LLC problem. The underline green parts highlight the positions where errors of different types may occur. Thebold blue partsindicate the error types. For eac...

work page
[75]

Start with 17 [Type 1: Operand Recall Error, 17→48 ]

work page
[76]

Add 48 [Type 2: Operation Recall Error, Add 48→Subtract 31 ]: 17 + 48 = 65

work page
[77]

Subtract 25: 65 [Type 3: Local Result Copy Error, 65→35 ]- [Type 4: Operand Interpretation Error, -→+ ]25 = 40

work page
[78]

—————— Omitting intermediate five steps

Add 99: 40 + 99 = 139 [Type 5: Local Calculation Error, 139→129 ]. —————— Omitting intermediate five steps ... ——————

work page
[79]

logs = [‘d1/’, ‘d2/’, ‘../’, ‘d2/’, ‘d3/’, ‘./’, ‘../’, ‘d3/’, ‘d4/’, ‘./’]

Subtract 88: 106 - 88 = 18. [Type 6/7: Reasoning Less/More Steps ] So, the final answer is 18. Figure 34: Possible error types in a MOAS problem. The underline green parts highlight the posi- tions where errors of different types may occur. Thebold blue partsindicate the error types. For each error type, we only mark one instance. 50 Published as a confer...

work page 2026
[80]

**Initial state:** folder depth = 0

work page
[81]

**Operation ‘d1/’[Type 1: Operation Recall Error, ‘d1/’→‘d2/’ ]**: folder depth = folder depth + 1 = 0 + 1 = 1

work page
[82]

**Operation ‘d2/’**: folder depth = folder depth + 1 = 1 [Type 2: Depth Copy Error, 1→0 ]+ 1 = 2

work page
[83]

**Operation ‘../’**: folder depth = max(0, folder depth - 1) = max(0, 1) = 1

work page
[84]

—————— Omitting intermediate five steps

**Operation ‘d2/’**: folder depth = folder depth + 1 = 1 + 1 = 2 [Type 3: Local Calculation Error, 2→1 ]. —————— Omitting intermediate five steps ... ——————

work page
[85]

min operation()

**Operation ‘./’**: folder depth = 4. (no change) [Type 4/5: Reasoning Less/More Steps ] Therefore, the final return value (folder depth) for the “min operation()” function with the input “logs” is 4 [Type 6: Final Depth Copy Error, 4→3 ]. So the answer is 4. Figure 35: Possible error types in a CLF problem. The underline green parts highlight the positio...

work page

Showing first 80 references.