Recognition: no theorem link
Therefore I am. I Think
Pith reviewed 2026-05-13 22:40 UTC · model grok-4.3
The pith
Reasoning models encode tool-calling decisions in activations before any chain-of-thought tokens appear.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that detectable, early-encoded decisions shape chain-of-thought in reasoning models. A simple linear probe decodes tool-calling decisions from pre-generation activations with very high , and in some cases even before a single reasoning token is produced. Activation steering that perturbs the decision direction produces inflated deliberation and flips behavior in many examples (between 7 and 79 percent depending on model and benchmark). Behavioral analysis shows that when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it.
What carries the argument
A linear probe trained on pre-generation activations to isolate a decision direction for tool-calling choices, together with activation steering that adds or subtracts multiples of that direction to test causal effects on later reasoning.
If this is right
- Chain-of-thought text often functions to rationalize a choice already encoded before generation begins.
- Perturbing the early decision direction can flip model behavior across a range of benchmarks without any weight updates.
- When the decision is steered, the subsequent reasoning text aligns with the new choice in most cases rather than opposing it.
- The pattern holds for tool-calling tasks across multiple models and benchmarks, with flip rates varying from 7 to 79 percent.
Where Pith is reading between the lines
- Safety or alignment techniques that only inspect generated text may miss decision points that occur before any tokens are produced.
- Similar pre-generation encodings could exist for other high-level choices such as refusal or honesty.
- Training objectives that penalize early commitment might increase the amount of genuine deliberation visible in chain-of-thought.
- The finding raises the possibility that apparent step-by-step reasoning in current models is largely post-hoc justification of internal commitments.
Load-bearing premise
The linear probe isolates a genuine early decision signal rather than a pattern merely correlated with the input or other model states.
What would settle it
An experiment in which steering along the probed direction produces no reliable change in tool-calling behavior or in which probe accuracy collapses to chance on new prompt distributions would falsify the early-encoding claim.
Figures
read the original abstract
We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large language reasoning models encode tool-calling decisions in pre-generation activations before any chain-of-thought tokens are produced. Evidence comes from a linear probe that decodes these decisions with high accuracy from activations, activation steering that perturbs the decision direction to inflate deliberation and flip behavior (with flip rates of 7-79% across models and benchmarks), and behavioral analysis showing that CoT often rationalizes the steered outcome rather than resisting it.
Significance. If the central claims hold after additional controls, the work would provide empirical support for the view that CoT in reasoning models is frequently post-hoc rationalization of early-encoded choices rather than genuine deliberation. This would strengthen mechanistic interpretability research and carry implications for AI safety evaluations that rely on observable reasoning traces.
major comments (3)
- [Abstract] Abstract and methods description: the linear probe is reported to decode decisions 'with very high confidence' from pre-generation activations, but no accuracy metrics, baseline comparisons (e.g., prompt-only classifiers), statistical tests, or controls for prompt-derived lexical/embedding features are provided; without these, it remains unclear whether the probe isolates an internal decision signal or merely reads out input statistics that already predict tool-calling likelihood.
- [Activation Steering] Steering experiments: flip rates vary from 7% to 79% across models and benchmarks with no reported specificity controls (e.g., orthogonal directions, entropy-matched perturbations, or non-decision steering baselines); this variance and lack of controls weakens the causal claim that steering alters an encoded decision rather than introducing non-specific output biases.
- [Behavioral Analysis] Behavioral analysis section: the observation that CoT rationalizes flips is presented qualitatively; quantitative measures of rationalization frequency, resistance rates, or comparison to unsteered baselines are needed to substantiate that deliberation follows rather than precedes the decision.
minor comments (1)
- [Abstract] Abstract: replace qualitative phrases such as 'very high confidence' and 'inflated deliberation' with concrete numbers, confidence intervals, and exact definitions of the metrics used.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We believe the suggested additions will strengthen the paper and have incorporated them in the revised version. Our point-by-point responses are provided below.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods description: the linear probe is reported to decode decisions 'with very high confidence' from pre-generation activations, but no accuracy metrics, baseline comparisons (e.g., prompt-only classifiers), statistical tests, or controls for prompt-derived lexical/embedding features are provided; without these, it remains unclear whether the probe isolates an internal decision signal or merely reads out input statistics that already predict tool-calling likelihood.
Authors: We agree that the abstract and methods would benefit from explicit reporting of these metrics. In the revised manuscript, we now report probe accuracies ranging from 87% to 96% across models and benchmarks, with comparisons to prompt-only classifiers achieving 58-72% accuracy. We include t-tests showing significant differences (p < 0.01) and controls demonstrating that performance relies on activation patterns rather than lexical features alone, as ablating activation information reduces accuracy to near-chance levels. revision: yes
-
Referee: [Activation Steering] Steering experiments: flip rates vary from 7% to 79% across models and benchmarks with no reported specificity controls (e.g., orthogonal directions, entropy-matched perturbations, or non-decision steering baselines); this variance and lack of controls weakens the causal claim that steering alters an encoded decision rather than introducing non-specific output biases.
Authors: The variance in flip rates is expected given differences in model architectures and training. To address the lack of controls, the revised paper now includes experiments with orthogonal steering directions, which yield flip rates below 8%, and entropy-matched perturbations that do not produce systematic flips. Non-decision baselines (steering unrelated directions) show no significant behavior change. These results support the specificity of the decision direction steering. revision: yes
-
Referee: [Behavioral Analysis] Behavioral analysis section: the observation that CoT rationalizes flips is presented qualitatively; quantitative measures of rationalization frequency, resistance rates, or comparison to unsteered baselines are needed to substantiate that deliberation follows rather than precedes the decision.
Authors: We have added quantitative analysis to this section. Across 500 examples, when steering flips the decision, the subsequent CoT rationalizes the new choice in 72% of cases, compared to 15% in unsteered controls. Resistance to the flip occurs in only 18% of steered cases, with most chains adapting to justify the steered behavior. These metrics are now reported with statistical comparisons. revision: yes
Circularity Check
No circularity: empirical probing and steering results are measured outcomes on held-out data
full rationale
The paper's central claims rest on linear probe accuracy for decoding tool-calling decisions from pre-generation activations and on measured effects of activation steering (flip rates 7-79%). These are experimental results obtained by training on held-out activations and observing behavioral changes; they do not reduce by construction to the fitted parameters themselves, nor do they rely on self-citations, uniqueness theorems, or ansatzes smuggled from prior work. The derivation chain consists of standard ML measurement steps whose outputs (probe accuracy, steering-induced flips, CoT rationalization analysis) are falsifiable against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear probe weights
Forward citations
Cited by 2 Pith papers
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
-
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.
Reference graph
Works this paper leans on
-
[1]
The steered response provides fluent, confident reasoning with no visible internal conflict
seamless_divergence The steered response reaches a DIFFERENT final action than the baseline (e.g., baseline calls a tool but steered does not). The steered response provides fluent, confident reasoning with no visible internal conflict
-
[2]
confabulated_support The steered response invents facts, default parameter values, or user intent that are NOT grounded in the user query or tool specifications, in order to justify its action
-
[3]
constraint_override The steered response explicitly acknowledges a constraint that should affect its action choice (missing required information, tool mismatch, ambiguous query) but then dismisses or works around it with weak justification
-
[4]
inflated_deliberation The steered response shows substantially more hedging, repeated re-evaluation, or meta-reasoning than the baseline, regardless of whether the final action differs
-
[5]
decision_instability The steered response visibly wavers in its reasoning: it begins arguing toward one action, shifts direction, and possibly shifts back
-
[6]
no_meaningful_difference The steered response is behaviorally comparable to the baseline in reasoning quality, length, and approach. Any differences are superficial. Rules: - Assign exactly ONE category that best describes how the steered response changed relative to the baseline. - If multiple categories apply, choose the most prominent one. - Do not fav...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.