arxiv: 2604.18307 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Reasoning Models Know What's Important, and Encode It in Their Activations

Yaniv Nikankin , Martin Tutek , Tomer Ashuach , Jonathan Rosenfeld , Yonatan Belinkov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords reasoning chainsmodel activationsstep importancelinear probeschain of thoughtinterpretabilitylanguage models

0 comments

The pith

Language models encode an internal representation of reasoning step importance in their activations prior to generating subsequent steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the varying importance of steps in a model's reasoning chain is best read from the tokens it produces or from its internal states. It finds that activations hold substantially more information about which steps matter than the output tokens do. Probes trained on those activations recover the importance signal even before later steps appear, and the signal generalizes across models, appears across layers, and is independent of obvious surface properties such as step length or position. A reader should care because this indicates that surface-level inspection of reasoning chains fundamentally underestimates what the model already knows about its own process.

Core claim

Models encode an internal representation of step importance in their activations even prior to the generation of subsequent steps. This representation generalizes across models, is distributed across layers, and does not correlate with surface-level features such as a step's relative position or length. Probes on activations therefore identify important steps more reliably than analysis of the tokens themselves.

What carries the argument

Linear probes trained on model activations to predict step importance, where importance is defined by whether removing the step changes the final answer.

If this is right

Activations contain more information about step importance than the generated tokens.
The importance signal is available before subsequent steps are produced.
The representation is shared across different models rather than idiosyncratic to one architecture.
Importance information is spread across multiple layers instead of localized.
The encoding is independent of surface statistics such as position or length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Intervening directly on the activations that encode importance could alter or shorten reasoning chains without changing the prompt.
Activation-based importance detection might be combined with existing pruning techniques to reduce compute on long chains.
The finding suggests that complete interpretability of reasoning requires looking inside the model rather than stopping at output inspection.

Load-bearing premise

The operational definition of importance via removability accurately reflects the causal role of the step inside the model's reasoning rather than some correlated artifact.

What would settle it

If probes trained on activations do not outperform token-based classifiers at predicting which steps, when removed, cause the model to produce an incorrect final answer.

Figures

Figures reproduced from arXiv: 2604.18307 by Jonathan Rosenfeld, Martin Tutek, Tomer Ashuach, Yaniv Nikankin, Yonatan Belinkov.

**Figure 1.** Figure 1: Overview of reasoning step importance. A. A reasoning chain can be partitioned into important and removable steps. Important steps make up a sparse subsequence that is sufficient to reach the correct answer. Removable steps can be omitted. B. The importance of a reasoning step is encoded in the model’s latent activations, and can be extracted using a probe, even before subsequent steps are generated. In co… view at source ↗

**Figure 2.** Figure 2: Importance detection requires full context and strong representations. We compare methods for predicting reasoning step importance pre-hoc, ranging from tokenbased approaches to full-context activations. Token-based LLM-as-a-judge, with or without full context, reaches low accuracies (≤ 60%). Methods using context-free representations (TFIDF, semantic embeddings, and context-free activations) perform mod… view at source ↗

**Figure 3.** Figure 3: Activation-based probes (on HARP) perform similarly across models. Left: Probes trained on activations from one model (ϕcross, x axis) to predict importance labels from another (ϕself, y axis) achieve high accuracy across all pairs. Right: We train two activation probes, on activations from ϕcross and ϕself, both with labels from ϕself. The fraction of reasoning steps assigned the same label by both probes… view at source ↗

**Figure 4.** Figure 4: Analysis of the importance probe in DeepSeek-7B (HARP dataset). (A) We train separate probes on residual activations at each layer and token position; no single layer or position matches the accuracy of the full probe trained on all layers. (B) Individual surface-level features show low correlation with probe predictions; (C) a regression model trained on all surface features (combined) performs substantia… view at source ↗

**Figure 5.** Figure 5: The effect of different attainability thresholds (τatnb) on the resulting lengths of removal experiments. D Performance statistics We report, for each model-dataset pair, the number of prompts that pass successive filtering stages in our analysis. Specifically, we report five prompt counts per setting: (i) the total number of analyzed prompts; (ii) the number of prompts for which the model completed the ge… view at source ↗

**Figure 7.** Figure 7: Evaluation of all importance detection methods on the 1500-sized LLM-as-ajudge evaluation subset. E.5 Math-500 cross-model results We show results for the cross-model analysis on the MATH-500 dataset in [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-model activation-based probe results on MATH-500. Probes trained on one model’s activations (ϕcross) are able to predict labels derived from another model (ϕself), and have high agreement ratios on individual labels with the probe trained when ϕcross = ϕself, suggesting the notion of importance is encoded in a universal manner across mathematical reasoning datasets. E.6 Probe analysis results Figures… view at source ↗

**Figure 9.** Figure 9: Analysis of the importance probe in DeepSeek-1.5B (HARP dataset). 0.5 Emb. 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 0.6 0.7 0.8 0.9 Position (%) Length # Numbers # ArithmeticFiller Verification Planning Computation Factual Ret. 0 0.2 0.4 0.6 0.8 1 Full Probe Surface Only 0.5 0.6 0.7 0.8 0.9 average first middle last Layer Accuracy Abs. Pearson r Accuracy Per-Layer Per-Token Probe Acc… view at source ↗

**Figure 10.** Figure 10: Analysis of the importance probe in DeepSeek-14B (HARP dataset). Emb. 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 0.5 0.6 0.7 0.8 0.9 Position (%) Length # Numbers # ArithmeticFiller Verification Planning Computation Factual Ret. 0 0.2 0.4 0.6 0.8 1 Full Probe Surface Only 0.5 0.6 0.7 0.8 0.9 average first middle last Layer Accuracy Abs. Pearson r Accuracy Per-Layer Per-Token Probe Accuracy Surface Feature Co… view at source ↗

**Figure 11.** Figure 11: Analysis of the importance probe in Olmo3-7B-Think (HARP dataset). 0.5 Emb. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 0.6 0.7 0.8 0.9 Position (%) Length # Numbers # ArithmeticFiller Verification Planning Computation Factual Ret. 0 0.2 0.4 0.6 0.8 1 Full Probe Surface Only 0.5 0.6 0.7 0.8 0.9 average first middle last Layer Accuracy Abs. Pearson r Accuracy Per-Layer Per-Token Probe Accuracy … view at source ↗

**Figure 12.** Figure 12: Analysis of the importance probe in GPT-OSS-20B (HARP dataset). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

read the original abstract

Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activations carry more info on important reasoning steps than tokens do, with the signal appearing early, but post-hoc labels and thin methods leave the internal causal claim shaky.

read the letter

Activations from these models seem to hold structured information about which steps in a reasoning chain matter most, and that information is available before the model produces the following steps. Probes beat token baselines, and the signal isn't just position or length. The work applies probing to step importance and adds the pre-generation timing plus checks for surface features. It shows the representation is distributed across layers and generalizes. They do a good job setting up the token vs activation comparison and ruling out some confounds. The results suggest internals can pick up on reasoning structure that surface text misses. Where it is thinner is the link to actual model behavior. Since importance comes from post-hoc removal tests, the activations might encode correlates of removability rather than an internal importance signal the model uses. The abstract lacks specifics on how the probes were built, what data they used, or any significance testing. There are no experiments that change the activations to see effects on output, which would strengthen the causal story. This is relevant for interpretability work on reasoning models. Readers who want to look inside CoT processes beyond generated text will find the approach interesting. I would send it for peer review. The central idea is worth a closer look with full methods and more tests.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that language models encode an internal representation of reasoning-step importance in their activations. Importance is operationalized via post-hoc removability (whether deleting a step changes the final answer). Probes trained on activations predict this label more accurately than token-based baselines; the signal appears in activations prior to generation of later steps, generalizes across models, is distributed over layers, and is uncorrelated with surface features such as relative position or step length.

Significance. If substantiated, the result would indicate that activation-based probing can recover planning-like information during chain-of-thought generation that is invisible from surface tokens alone. The pre-generation timing, cross-model generalization, and reported independence from length/position are potentially valuable contributions to mechanistic interpretability of reasoning, provided the correlational evidence is strengthened with causal tests.

major comments (3)

[Abstract and §3 (Methods)] Abstract and §3 (Methods): the abstract and methods provide no details on probe architecture, training data, regularization, or controls for confounds. Without these specifics it is impossible to determine whether the reported superiority of activations over tokens reflects genuine encoding or experimental artifacts.
[§4 (Results)] §4 (Results): the central claim that activations contain an 'internal representation of step importance' rests on probes predicting post-hoc removability labels. This is correlational evidence only; the manuscript contains no activation interventions, patching, or causal ablation experiments that would show the model actually consults this representation when deciding what to generate next.
[§4.2 (Controls and Timing)] §4.2 (Controls and Timing): the assertion that the signal 'does not correlate with surface-level features' and is present 'prior to the generation of subsequent steps' is consistent with both causal encoding and spurious correlation. Explicit correlation coefficients with position/length and a direct comparison against a position-only baseline are required to support the stronger interpretation.

minor comments (2)

[Throughout] All result tables and figures should report error bars or confidence intervals and the exact statistical test used for each comparison.
[§2 (Related Work)] The related-work section should more explicitly situate the probing approach relative to prior studies on CoT faithfulness and internal representations of reasoning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and strengthening. We have revised the manuscript to address each point, adding methodological details, explicit controls, and a discussion of the correlational nature of the evidence. Our responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract and §3 (Methods)] Abstract and §3 (Methods): the abstract and methods provide no details on probe architecture, training data, regularization, or controls for confounds. Without these specifics it is impossible to determine whether the reported superiority of activations over tokens reflects genuine encoding or experimental artifacts.

Authors: We agree that the original manuscript lacked sufficient methodological detail. In the revised version, Section 3 has been expanded to specify the probe architecture (linear logistic regression classifiers with L2 regularization), training data construction and splits (80/20 train/test per model with 5-fold cross-validation), hyperparameter tuning via grid search, and additional confound controls including matching on token frequency and step embedding similarity. The abstract has been updated with a concise description of the probing approach. These revisions enable better evaluation of whether the activation advantage reflects genuine internal representations. revision: yes
Referee: [§4 (Results)] §4 (Results): the central claim that activations contain an 'internal representation of step importance' rests on probes predicting post-hoc removability labels. This is correlational evidence only; the manuscript contains no activation interventions, patching, or causal ablation experiments that would show the model actually consults this representation when deciding what to generate next.

Authors: We acknowledge that the evidence is correlational: the probes show that step importance is linearly decodable from pre-generation activations and outperforms token baselines. This timing and the distributed nature across layers provide suggestive support for an internal representation, but do not demonstrate causal use by the model. We have added a new paragraph in the Discussion section explicitly noting the correlational limitation and outlining future causal experiments such as activation patching or steering to test whether the model relies on these signals during generation. revision: yes
Referee: [§4.2 (Controls and Timing)] §4.2 (Controls and Timing): the assertion that the signal 'does not correlate with surface-level features' and is present 'prior to the generation of subsequent steps' is consistent with both causal encoding and spurious correlation. Explicit correlation coefficients with position/length and a direct comparison against a position-only baseline are required to support the stronger interpretation.

Authors: We have addressed this by adding explicit Pearson correlation coefficients in revised §4.2, showing near-zero correlations between importance labels and both relative position (r = 0.04) and step length (r = 0.07). We also trained and evaluated a position-only baseline probe (using sinusoidal positional encodings as input) and report that activation-based probes achieve substantially higher accuracy (average +18% F1 across models). These results, now in Table 2 and Figure 3, support that the signal is not explained by surface features alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical probe study with independent external labels

full rationale

The paper defines importance via an external, post-hoc removability procedure (delete step, re-run, check answer change) that does not reference activations or probe outputs. Probes are trained to predict these independent labels from activations, with reported controls showing the signal is not reducible to position, length, or token baselines. No equations or claims reduce the reported internal representation to a fitted parameter or self-citation by construction. The timing argument (activations before later steps) and generalization results are additional empirical observations, not definitional. This is a standard supervised probing setup against an external benchmark, warranting score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that step importance can be reliably labeled for probe supervision and that linear or simple probes can extract causally relevant internal representations; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Step importance can be operationalized via removability or similar proxy suitable for supervised probe training.
This labeling is required to train the importance-predicting probes described in the abstract.

pith-pipeline@v0.9.0 · 5474 in / 1208 out tokens · 50673 ms · 2026-05-10T05:22:18.243687+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages · 1 internal anchor

[1]

OpenAI o1 System Card

URLhttps://aclanthology.org/2024.acl-long.254/. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 2024
[2]

ground-truth

URLhttps://arxiv.org/abs/1703.01365. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, and Zhijiang Guo. Atom of thoughts for ...

work page doi:10.18653/v1/2022.emnlp-main.793 2025
[3]

Linear: a single linear layer from the averaged activation vector to a scalar
[4]

We sweep over hidden dimensions 128, 256 and depth 2, 5

MLP: an MLP with a ReLU non-linearity. We sweep over hidden dimensions 128, 256 and depth 2, 5
[5]

Layerwise-Linear (LW-Linear): We flatten the(L, D) per-layer activation represen- tation and apply a single linear layer
[6]

let me think

Layerwise-MLP (LW-MLP): the same flattened input is passed through a 2-layer MLP . We sweep over the hidden probe dimension from 16, 128, 1024, 4096. In the first two architectures (Linear and MLP), the activation tensors are averaged across both token and layer dimensions, resulting in d−shaped inputs, where d is the model’s hidden dimension. Across the ...

2025
[7]

Actually parallelogram sides: one side is vertical segment between (0,c) and (0,d)
[8]

Let’s analyze region
[9]

We can write ase iθ (e−iθ −1+e iθ ) =e iθ (1−e iθ +e i2θ )?
[10]

Term1:(x−z) 2/2= (x 2 −2xz+z 2)/2. Good
[11]

But we must check: Did we compute correctly?
[12]

So AC direction vector (x C, h) is bisector of angle between BA and AD
[13]

Similarly odd:Π n k=1((2k−1) 2 + (2k−1) +1/2)((2k−1) 2 −(2k−1) +1/2)
[14]

So we have factorization:x 4 +1/4= (x 2 +x+1/2)(x 2 −x+1/2). Good
[15]

GT: Removable — Activation Probe Correct:×

So we need to find intersection with AC: G + t(c-b) = u c. GT: Removable — Activation Probe Correct:×
[16]

So x must be≥(3+sqrt33)/2≈4.372
[17]

Because sum of 0 to 26 = 26*27/2 = 351
[18]

Without loss of generality, we can choose F to be a particular face, say 1?
[19]

But we should also check interior points maybe produce lower?
[20]

7.r 2 must be one of these

But we need actual integer values. 7.r 2 must be one of these
[21]

So square: 4(3x+6) =4(3x+6). trivial
[22]

So compare coefficients:
[23]

Then compute their average
[24]

a/(b+c) = a/(c(v+1)) = (a/c)/(v+1) = u/(v+1)
[25]

But f is not one-to-one globally
[26]

Area = R 1 x=0[(x+ 1)−( 1 −x)]dx+ R 2 x=1[(x+ 1)−(x− 1)]dx+ R 3 x=2[3 −(x− 1)]dx
[27]

28 Preprint

Thenu 2 =v 2 −9=36±4sqrt41. 28 Preprint. Under review. GT: Non-removable — Activation Probe Correct:✓
[28]

So equation: 6r 2 −19r−7=8r 2 −34r+21
[29]

So compute:(3−(−4−5i)) =3+4+5i=7+5i
[30]

Thus area of triangleACD= ( 1/2)∗AD∗CD= ( 1/2)∗( 64/17)∗( 120/17) = (1/2)∗(64∗120)/(17 2)
[31]

555<625, so highest power is 5 3 =125
[32]

Soz1= (a+2i)/a=1+ (2i)/a
[33]

We need to maximizef(x) =4(x+7)(2−x)
[34]

- above line x + y = 1 - above line y = x - 1 - below line y = x + 1 - within square
[35]

Shoelace sum1 =x i ∗y i+1 :− 1 ∗ 4 =− 42 ∗(− 4) =− 82 ∗(− 1) =− 2 − 1 ∗ 1 = −1Sum1 =−4−8−2−1=−15
[36]

Let S = 3n+3 = 3(n+1)
[37]

Compute 5x+9 at x=-2: 5*(-2)+9 = -10+9 = -1
[38]

Let’s compute vectors
[39]

For i=2, j=2: f(2,2)=f(1, f(2,1))=f(1,0)=2. Yes
[40]

For x between 0 and 1, y between 1 and 0
[41]

GT: Non-removable — Activation Probe Correct:×

We need count of pairs (a,b) such that a*b divisible by 5. GT: Non-removable — Activation Probe Correct:×
[42]

So quotientx 2 −4x−12
[43]

Check if within interval [-1,1/2]. Yes
[44]

Compute a=sqrt( 202 + 400) =sqrt( 400 + 400) =sqrt( 800) = 20 ∗sqrt( 2) = 28.2843
[45]

That gives u = 39, 89, 139, etc
[46]

7.G−M= ((b+c x)/3 − 2b/3, cy/3 − 0) = ((b+c x − 2b)/3, cy/3) = ((c x − b)/3,c y/3)

- Fori≥1 :−i f j=0 :f(i, 0) =f(i−1, 1). 7.G−M= ((b+c x)/3 − 2b/3, cy/3 − 0) = ((b+c x − 2b)/3, cy/3) = ((c x − b)/3,c y/3)
[47]

Wait 8 2 +15 2 =64+225=289=17 2. Yes!
[48]

Equation same:kx= (x+2) 2
[49]

The problem says: The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking 4x more dogs and Nate stayed at 12 dogs or if x of Nate’s dogs were reassigned to Denali
[50]

Now Archimedes wants to minimize f(x)
[51]

Since 1985 is odd, we can compute: R = 1 - 2 + 3 - 4 + ... + 1985. E.4 Reasoning step importance prediction results In Section 4, we define and evaluate several methods for predicting reasoning step impor- tance from tokens and latent representations. Due to budget constraints on the number of steps that could be processed through the external LLM-as-a-ju...

1985
[52]

This is the step-level parallel to the removal intervention from the main paper

Removal: the step is deleted entirely from the reasoning chain. This is the step-level parallel to the removal intervention from the main paper
[53]

Let me think

Filler: the step is replaced with repetitions of the filler sentence “Let me think...” padded with...tokens to match the original token count
[54]

Semantic paraphrase: the step is replaced with a semantically equivalent restate- ment generated by an external LLM (Gemini-2.5-Flash), preserving meaning while varying its tokens
[55]

smallest product

Context-averaged activation: the step’s activations are replaced with its context- averaged activations. To calculate the context-averaged activations, we append the target reasoning step to 10 randomly sampled reasoning chains, after a randomly chosen reasoning step. We calculate the target reasoning step’s activations and average them across those conte...

2020
[56]

So, Chandler has $100 from his birthday gifts

Hmm, 50 plus 35 is 85, and then 85 plus 15 is 100. So, Chandler has $100 from his birthday gifts. Now, the bike costs$500, and he already has $100. So, I need to find out how much more money he needs. That would be 500 minus 100, which is 400. So, Chandler still needs$400. He earns$16 per week from his paper route.He’s going to use all of his birthday mon...
[57]

Let me calculate that

So, w equals 400 divided by 16. Let me calculate that. 400 divided by 16. Hmm, 16 times 25 is 400 because 16 times 20 is 320, and 16 times 5 is 80, so 320 plus 80 is 400. So, 16 times 25 is 400, which means w is
[58]

16 goes into 400 how many times? 16 times 10 is 160, so 400 minus 160 is 240

Wait, let me double-check that division to make sure I didn’t make a mistake. 16 goes into 400 how many times? 16 times 10 is 160, so 400 minus 160 is 240. 16 times 15 is 240, so 10 plus 15 is 25. Yep, that’s correct. So, 16 times 25 is indeed 400. Therefore, Chandler needs to work for 25 weeks to earn the remaining$400 after using his birthday money. So,...
[59]

a fixed monthly fee plus an hourly charge for connect time

From equation 1: F + (H*C) = 12.48 We just found that H*C = 5.06, so substituting that in:F+ 5.06 = 12.48 Therefore,F= 12.48 − 5.06 Let me compute that. 12.48 minus 5.06. Hmm, 12.48 − 5 is 7.48, then subtract another 0.06, so 7.42. So F =$7.42? Wait, let me check the subtraction again to be sure. 12.48 minus 5.06: 12.48 -5.06 -------- Let’s do it step by ...