Recognition: unknown
Reasoning Models Know What's Important, and Encode It in Their Activations
Pith reviewed 2026-05-10 05:22 UTC · model grok-4.3
The pith
Language models encode an internal representation of reasoning step importance in their activations prior to generating subsequent steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models encode an internal representation of step importance in their activations even prior to the generation of subsequent steps. This representation generalizes across models, is distributed across layers, and does not correlate with surface-level features such as a step's relative position or length. Probes on activations therefore identify important steps more reliably than analysis of the tokens themselves.
What carries the argument
Linear probes trained on model activations to predict step importance, where importance is defined by whether removing the step changes the final answer.
If this is right
- Activations contain more information about step importance than the generated tokens.
- The importance signal is available before subsequent steps are produced.
- The representation is shared across different models rather than idiosyncratic to one architecture.
- Importance information is spread across multiple layers instead of localized.
- The encoding is independent of surface statistics such as position or length.
Where Pith is reading between the lines
- Intervening directly on the activations that encode importance could alter or shorten reasoning chains without changing the prompt.
- Activation-based importance detection might be combined with existing pruning techniques to reduce compute on long chains.
- The finding suggests that complete interpretability of reasoning requires looking inside the model rather than stopping at output inspection.
Load-bearing premise
The operational definition of importance via removability accurately reflects the causal role of the step inside the model's reasoning rather than some correlated artifact.
What would settle it
If probes trained on activations do not outperform token-based classifiers at predicting which steps, when removed, cause the model to produce an incorrect final answer.
Figures
read the original abstract
Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step's relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that language models encode an internal representation of reasoning-step importance in their activations. Importance is operationalized via post-hoc removability (whether deleting a step changes the final answer). Probes trained on activations predict this label more accurately than token-based baselines; the signal appears in activations prior to generation of later steps, generalizes across models, is distributed over layers, and is uncorrelated with surface features such as relative position or step length.
Significance. If substantiated, the result would indicate that activation-based probing can recover planning-like information during chain-of-thought generation that is invisible from surface tokens alone. The pre-generation timing, cross-model generalization, and reported independence from length/position are potentially valuable contributions to mechanistic interpretability of reasoning, provided the correlational evidence is strengthened with causal tests.
major comments (3)
- [Abstract and §3 (Methods)] Abstract and §3 (Methods): the abstract and methods provide no details on probe architecture, training data, regularization, or controls for confounds. Without these specifics it is impossible to determine whether the reported superiority of activations over tokens reflects genuine encoding or experimental artifacts.
- [§4 (Results)] §4 (Results): the central claim that activations contain an 'internal representation of step importance' rests on probes predicting post-hoc removability labels. This is correlational evidence only; the manuscript contains no activation interventions, patching, or causal ablation experiments that would show the model actually consults this representation when deciding what to generate next.
- [§4.2 (Controls and Timing)] §4.2 (Controls and Timing): the assertion that the signal 'does not correlate with surface-level features' and is present 'prior to the generation of subsequent steps' is consistent with both causal encoding and spurious correlation. Explicit correlation coefficients with position/length and a direct comparison against a position-only baseline are required to support the stronger interpretation.
minor comments (2)
- [Throughout] All result tables and figures should report error bars or confidence intervals and the exact statistical test used for each comparison.
- [§2 (Related Work)] The related-work section should more explicitly situate the probing approach relative to prior studies on CoT faithfulness and internal representations of reasoning.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for clarification and strengthening. We have revised the manuscript to address each point, adding methodological details, explicit controls, and a discussion of the correlational nature of the evidence. Our responses to the major comments are provided below.
read point-by-point responses
-
Referee: [Abstract and §3 (Methods)] Abstract and §3 (Methods): the abstract and methods provide no details on probe architecture, training data, regularization, or controls for confounds. Without these specifics it is impossible to determine whether the reported superiority of activations over tokens reflects genuine encoding or experimental artifacts.
Authors: We agree that the original manuscript lacked sufficient methodological detail. In the revised version, Section 3 has been expanded to specify the probe architecture (linear logistic regression classifiers with L2 regularization), training data construction and splits (80/20 train/test per model with 5-fold cross-validation), hyperparameter tuning via grid search, and additional confound controls including matching on token frequency and step embedding similarity. The abstract has been updated with a concise description of the probing approach. These revisions enable better evaluation of whether the activation advantage reflects genuine internal representations. revision: yes
-
Referee: [§4 (Results)] §4 (Results): the central claim that activations contain an 'internal representation of step importance' rests on probes predicting post-hoc removability labels. This is correlational evidence only; the manuscript contains no activation interventions, patching, or causal ablation experiments that would show the model actually consults this representation when deciding what to generate next.
Authors: We acknowledge that the evidence is correlational: the probes show that step importance is linearly decodable from pre-generation activations and outperforms token baselines. This timing and the distributed nature across layers provide suggestive support for an internal representation, but do not demonstrate causal use by the model. We have added a new paragraph in the Discussion section explicitly noting the correlational limitation and outlining future causal experiments such as activation patching or steering to test whether the model relies on these signals during generation. revision: yes
-
Referee: [§4.2 (Controls and Timing)] §4.2 (Controls and Timing): the assertion that the signal 'does not correlate with surface-level features' and is present 'prior to the generation of subsequent steps' is consistent with both causal encoding and spurious correlation. Explicit correlation coefficients with position/length and a direct comparison against a position-only baseline are required to support the stronger interpretation.
Authors: We have addressed this by adding explicit Pearson correlation coefficients in revised §4.2, showing near-zero correlations between importance labels and both relative position (r = 0.04) and step length (r = 0.07). We also trained and evaluated a position-only baseline probe (using sinusoidal positional encodings as input) and report that activation-based probes achieve substantially higher accuracy (average +18% F1 across models). These results, now in Table 2 and Figure 3, support that the signal is not explained by surface features alone. revision: yes
Circularity Check
No significant circularity; empirical probe study with independent external labels
full rationale
The paper defines importance via an external, post-hoc removability procedure (delete step, re-run, check answer change) that does not reference activations or probe outputs. Probes are trained to predict these independent labels from activations, with reported controls showing the signal is not reducible to position, length, or token baselines. No equations or claims reduce the reported internal representation to a fitted parameter or self-citation by construction. The timing argument (activations before later steps) and generalization results are additional empirical observations, not definitional. This is a standard supervised probing setup against an external benchmark, warranting score 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Step importance can be operationalized via removability or similar proxy suitable for supervised probe training.
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2024.acl-long.254/. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 2024
-
[2]
URLhttps://arxiv.org/abs/1703.01365. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. Fengwei Teng, Quan Shi, Zhaoyang Yu, Jiayi Zhang, Yuyu Luo, Chenglin Wu, and Zhijiang Guo. Atom of thoughts for ...
-
[3]
Linear: a single linear layer from the averaged activation vector to a scalar
-
[4]
We sweep over hidden dimensions 128, 256 and depth 2, 5
MLP: an MLP with a ReLU non-linearity. We sweep over hidden dimensions 128, 256 and depth 2, 5
-
[5]
Layerwise-Linear (LW-Linear): We flatten the(L, D) per-layer activation represen- tation and apply a single linear layer
-
[6]
let me think
Layerwise-MLP (LW-MLP): the same flattened input is passed through a 2-layer MLP . We sweep over the hidden probe dimension from 16, 128, 1024, 4096. In the first two architectures (Linear and MLP), the activation tensors are averaged across both token and layer dimensions, resulting in d−shaped inputs, where d is the model’s hidden dimension. Across the ...
2025
-
[7]
Actually parallelogram sides: one side is vertical segment between (0,c) and (0,d)
-
[8]
Let’s analyze region
-
[9]
We can write ase iθ (e−iθ −1+e iθ ) =e iθ (1−e iθ +e i2θ )?
-
[10]
Term1:(x−z) 2/2= (x 2 −2xz+z 2)/2. Good
-
[11]
But we must check: Did we compute correctly?
-
[12]
So AC direction vector (x C, h) is bisector of angle between BA and AD
-
[13]
Similarly odd:Π n k=1((2k−1) 2 + (2k−1) +1/2)((2k−1) 2 −(2k−1) +1/2)
-
[14]
So we have factorization:x 4 +1/4= (x 2 +x+1/2)(x 2 −x+1/2). Good
-
[15]
GT: Removable — Activation Probe Correct:×
So we need to find intersection with AC: G + t(c-b) = u c. GT: Removable — Activation Probe Correct:×
-
[16]
So x must be≥(3+sqrt33)/2≈4.372
-
[17]
Because sum of 0 to 26 = 26*27/2 = 351
-
[18]
Without loss of generality, we can choose F to be a particular face, say 1?
-
[19]
But we should also check interior points maybe produce lower?
-
[20]
7.r 2 must be one of these
But we need actual integer values. 7.r 2 must be one of these
-
[21]
So square: 4(3x+6) =4(3x+6). trivial
-
[22]
So compare coefficients:
-
[23]
Then compute their average
-
[24]
a/(b+c) = a/(c(v+1)) = (a/c)/(v+1) = u/(v+1)
-
[25]
But f is not one-to-one globally
-
[26]
Area = R 1 x=0[(x+ 1)−( 1 −x)]dx+ R 2 x=1[(x+ 1)−(x− 1)]dx+ R 3 x=2[3 −(x− 1)]dx
-
[27]
28 Preprint
Thenu 2 =v 2 −9=36±4sqrt41. 28 Preprint. Under review. GT: Non-removable — Activation Probe Correct:✓
-
[28]
So equation: 6r 2 −19r−7=8r 2 −34r+21
-
[29]
So compute:(3−(−4−5i)) =3+4+5i=7+5i
-
[30]
Thus area of triangleACD= ( 1/2)∗AD∗CD= ( 1/2)∗( 64/17)∗( 120/17) = (1/2)∗(64∗120)/(17 2)
-
[31]
555<625, so highest power is 5 3 =125
-
[32]
Soz1= (a+2i)/a=1+ (2i)/a
-
[33]
We need to maximizef(x) =4(x+7)(2−x)
-
[34]
- above line x + y = 1 - above line y = x - 1 - below line y = x + 1 - within square
-
[35]
Shoelace sum1 =x i ∗y i+1 :− 1 ∗ 4 =− 42 ∗(− 4) =− 82 ∗(− 1) =− 2 − 1 ∗ 1 = −1Sum1 =−4−8−2−1=−15
-
[36]
Let S = 3n+3 = 3(n+1)
-
[37]
Compute 5x+9 at x=-2: 5*(-2)+9 = -10+9 = -1
-
[38]
Let’s compute vectors
-
[39]
For i=2, j=2: f(2,2)=f(1, f(2,1))=f(1,0)=2. Yes
-
[40]
For x between 0 and 1, y between 1 and 0
-
[41]
GT: Non-removable — Activation Probe Correct:×
We need count of pairs (a,b) such that a*b divisible by 5. GT: Non-removable — Activation Probe Correct:×
-
[42]
So quotientx 2 −4x−12
-
[43]
Check if within interval [-1,1/2]. Yes
-
[44]
Compute a=sqrt( 202 + 400) =sqrt( 400 + 400) =sqrt( 800) = 20 ∗sqrt( 2) = 28.2843
-
[45]
That gives u = 39, 89, 139, etc
-
[46]
7.G−M= ((b+c x)/3 − 2b/3, cy/3 − 0) = ((b+c x − 2b)/3, cy/3) = ((c x − b)/3,c y/3)
- Fori≥1 :−i f j=0 :f(i, 0) =f(i−1, 1). 7.G−M= ((b+c x)/3 − 2b/3, cy/3 − 0) = ((b+c x − 2b)/3, cy/3) = ((c x − b)/3,c y/3)
-
[47]
Wait 8 2 +15 2 =64+225=289=17 2. Yes!
-
[48]
Equation same:kx= (x+2) 2
-
[49]
The problem says: The ratio of Denali’s pay to Nate’s pay would be the same if Denali started walking 4x more dogs and Nate stayed at 12 dogs or if x of Nate’s dogs were reassigned to Denali
-
[50]
Now Archimedes wants to minimize f(x)
-
[51]
Since 1985 is odd, we can compute: R = 1 - 2 + 3 - 4 + ... + 1985. E.4 Reasoning step importance prediction results In Section 4, we define and evaluate several methods for predicting reasoning step impor- tance from tokens and latent representations. Due to budget constraints on the number of steps that could be processed through the external LLM-as-a-ju...
1985
-
[52]
This is the step-level parallel to the removal intervention from the main paper
Removal: the step is deleted entirely from the reasoning chain. This is the step-level parallel to the removal intervention from the main paper
-
[53]
Let me think
Filler: the step is replaced with repetitions of the filler sentence “Let me think...” padded with...tokens to match the original token count
-
[54]
Semantic paraphrase: the step is replaced with a semantically equivalent restate- ment generated by an external LLM (Gemini-2.5-Flash), preserving meaning while varying its tokens
-
[55]
smallest product
Context-averaged activation: the step’s activations are replaced with its context- averaged activations. To calculate the context-averaged activations, we append the target reasoning step to 10 randomly sampled reasoning chains, after a randomly chosen reasoning step. We calculate the target reasoning step’s activations and average them across those conte...
2020
-
[56]
So, Chandler has $100 from his birthday gifts
Hmm, 50 plus 35 is 85, and then 85 plus 15 is 100. So, Chandler has $100 from his birthday gifts. Now, the bike costs$500, and he already has $100. So, I need to find out how much more money he needs. That would be 500 minus 100, which is 400. So, Chandler still needs$400. He earns$16 per week from his paper route.He’s going to use all of his birthday mon...
-
[57]
Let me calculate that
So, w equals 400 divided by 16. Let me calculate that. 400 divided by 16. Hmm, 16 times 25 is 400 because 16 times 20 is 320, and 16 times 5 is 80, so 320 plus 80 is 400. So, 16 times 25 is 400, which means w is
-
[58]
16 goes into 400 how many times? 16 times 10 is 160, so 400 minus 160 is 240
Wait, let me double-check that division to make sure I didn’t make a mistake. 16 goes into 400 how many times? 16 times 10 is 160, so 400 minus 160 is 240. 16 times 15 is 240, so 10 plus 15 is 25. Yep, that’s correct. So, 16 times 25 is indeed 400. Therefore, Chandler needs to work for 25 weeks to earn the remaining$400 after using his birthday money. So,...
-
[59]
a fixed monthly fee plus an hourly charge for connect time
From equation 1: F + (H*C) = 12.48 We just found that H*C = 5.06, so substituting that in:F+ 5.06 = 12.48 Therefore,F= 12.48 − 5.06 Let me compute that. 12.48 minus 5.06. Hmm, 12.48 − 5 is 7.48, then subtract another 0.06, so 7.42. So F =$7.42? Wait, let me check the subtraction again to be sure. 12.48 minus 5.06: 12.48 -5.06 -------- Let’s do it step by ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.