arxiv: 2604.01202 · v3 · submitted 2026-04-01 · 💻 cs.AI

Recognition: no theorem link

Therefore I am. I Think

Esakkivel Esakkiraja , Sai Rajeswar , Denis Akhiyarov , Rajagopal Venkatesaramani

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords chain-of-thoughtactivation steeringlinear probestool callingdecision encodingreasoning modelslarge language models

0 comments

The pith

Reasoning models encode tool-calling decisions in activations before any chain-of-thought tokens appear.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether reasoning models decide on actions such as tool calls first and then generate text to support that choice, or whether the text deliberation itself produces the decision. Evidence comes from a linear probe that reads out the eventual tool-calling choice from internal activations with high accuracy, sometimes before the model has emitted a single reasoning token. Activation steering that perturbs this early direction increases deliberation length and reverses the model's final behavior in a substantial fraction of cases. When the steered model changes its choice, the generated chain-of-thought typically explains the new choice rather than arguing against it. These results indicate that the visible reasoning process is shaped by an earlier internal commitment.

Core claim

The central claim is that detectable, early-encoded decisions shape chain-of-thought in reasoning models. A simple linear probe decodes tool-calling decisions from pre-generation activations with very high , and in some cases even before a single reasoning token is produced. Activation steering that perturbs the decision direction produces inflated deliberation and flips behavior in many examples (between 7 and 79 percent depending on model and benchmark). Behavioral analysis shows that when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it.

What carries the argument

A linear probe trained on pre-generation activations to isolate a decision direction for tool-calling choices, together with activation steering that adds or subtracts multiples of that direction to test causal effects on later reasoning.

If this is right

Chain-of-thought text often functions to rationalize a choice already encoded before generation begins.
Perturbing the early decision direction can flip model behavior across a range of benchmarks without any weight updates.
When the decision is steered, the subsequent reasoning text aligns with the new choice in most cases rather than opposing it.
The pattern holds for tool-calling tasks across multiple models and benchmarks, with flip rates varying from 7 to 79 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety or alignment techniques that only inspect generated text may miss decision points that occur before any tokens are produced.
Similar pre-generation encodings could exist for other high-level choices such as refusal or honesty.
Training objectives that penalize early commitment might increase the amount of genuine deliberation visible in chain-of-thought.
The finding raises the possibility that apparent step-by-step reasoning in current models is largely post-hoc justification of internal commitments.

Load-bearing premise

The linear probe isolates a genuine early decision signal rather than a pattern merely correlated with the input or other model states.

What would settle it

An experiment in which steering along the probed direction produces no reliable change in tool-calling behavior or in which probe accuracy collapses to chance on new prompt distributions would falsify the early-encoding claim.

Figures

Figures reproduced from arXiv: 2604.01202 by Denis Akhiyarov, Esakkivel Esakkiraja, Rajagopal Venkatesaramani, Sai Rajeswar.

**Figure 1.** Figure 1: Overview of our methodology. Linear probes detect action decisions. We apply steering vectors, and measure quantitative as well as behavioral impact on CoT. 3 Methods 3.1 Models, Data, and Benchmarks We focus our analysis on two recently introduced, top-performing open-weight reasoning models: Qwen3-4B and GLM-Z1-9B. While we provide supplemental results for GPT-OSS20B in the appendix, we exclude it from … view at source ↗

**Figure 2.** Figure 2: Decision predictability using probes at layer 20 for Qwen3-4B and GLM-Z1-9B. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Agreement ratio between decisions detected by probe at layer 20 for various stages [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Example of injection steering (Qwen3-4B) that forces a tool call when the baseline [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Probe AUROC across sampled layers and generation positions on When2Call [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Probe AUROC across sampled layers and generation positions on BFCL for [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Probe AUROC across sampled layers and generation positions on When2Call [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Probe AUROC across sampled layers and generation positions on BFCL for GPT [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Decision predictability across positions on When2Call for GPT-OSS-20B under [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Decision predictability across positions on BFCL for GPT-OSS-20B under medium [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Agreement with the final think_end probe on When2Call for GPT-OSS-20B under medium and high reasoning. Early-position agreement remains lower than for the two main models, before recovering toward later positions [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Agreement with the final think_end probe on BFCL for GPT-OSS-20B under medium and high reasoning. Agreement likewise strengthens toward later positions in the reasoning trace. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Suppression example. The probe assigns 0.9992 tool probability. At baseline, the [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Resistant suppression example. The probe assigns 1.00 tool probability. The [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Injection-resistant example. The probe assigns [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds early linear-decodable signals for tool decisions in pre-generation activations and shows steered CoT often rationalizes the flip, but the causal interpretation rests on untested assumptions about prompt features.

read the letter

The main point is that this work tests whether reasoning models encode tool-calling choices in activations before any chain-of-thought tokens appear. They train a linear probe on those early activations and report high accuracy at predicting the eventual decision. Steering along the probe direction then changes behavior in a noticeable fraction of cases, and the generated reasoning tends to justify the new outcome rather than fight it. That combination of pre-generation decoding plus the rationalization observation is the concrete new piece relative to earlier interpretability work on CoT timing. The experiments cover multiple models and benchmarks, which gives the pattern some breadth, and the empirical setup avoids obvious circularity since the probe and steering are applied to held-out activations. The behavioral check on how the text responds to the intervention is a useful detail that strengthens the story. The soft spots sit mainly in the causal step. Without controls that isolate prompt-derived statistical patterns from any internal decision state, the probe could simply be reading out lexical or embedding cues that already correlate with the label. Steering inherits the same uncertainty: a direction shift might alter output length or entropy through nonspecific mechanisms rather than flipping a specific choice representation. The reported flip-rate range of 7-79% across setups also hints at sensitivity to details that are not yet pinned down. Full methods would need to show the exact probe training procedure, any input-feature ablations, and statistical support for the accuracy claims. This is the sort of paper that belongs in a mechanistic interpretability reading group for people working on reasoning models and alignment questions around CoT reliability. Readers who want activation-level data on decision timing will get something usable from it, even if they will want tighter controls before treating the early-encoding claim as settled. I would send it to peer review. The question is direct and the experiments are a reasonable first cut, so referees can focus on the specificity issues and reproducibility.

Referee Report

3 major / 1 minor

Summary. The paper claims that large language reasoning models encode tool-calling decisions in pre-generation activations before any chain-of-thought tokens are produced. Evidence comes from a linear probe that decodes these decisions with high accuracy from activations, activation steering that perturbs the decision direction to inflate deliberation and flip behavior (with flip rates of 7-79% across models and benchmarks), and behavioral analysis showing that CoT often rationalizes the steered outcome rather than resisting it.

Significance. If the central claims hold after additional controls, the work would provide empirical support for the view that CoT in reasoning models is frequently post-hoc rationalization of early-encoded choices rather than genuine deliberation. This would strengthen mechanistic interpretability research and carry implications for AI safety evaluations that rely on observable reasoning traces.

major comments (3)

[Abstract] Abstract and methods description: the linear probe is reported to decode decisions 'with very high confidence' from pre-generation activations, but no accuracy metrics, baseline comparisons (e.g., prompt-only classifiers), statistical tests, or controls for prompt-derived lexical/embedding features are provided; without these, it remains unclear whether the probe isolates an internal decision signal or merely reads out input statistics that already predict tool-calling likelihood.
[Activation Steering] Steering experiments: flip rates vary from 7% to 79% across models and benchmarks with no reported specificity controls (e.g., orthogonal directions, entropy-matched perturbations, or non-decision steering baselines); this variance and lack of controls weakens the causal claim that steering alters an encoded decision rather than introducing non-specific output biases.
[Behavioral Analysis] Behavioral analysis section: the observation that CoT rationalizes flips is presented qualitatively; quantitative measures of rationalization frequency, resistance rates, or comparison to unsteered baselines are needed to substantiate that deliberation follows rather than precedes the decision.

minor comments (1)

[Abstract] Abstract: replace qualitative phrases such as 'very high confidence' and 'inflated deliberation' with concrete numbers, confidence intervals, and exact definitions of the metrics used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We believe the suggested additions will strengthen the paper and have incorporated them in the revised version. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [Abstract] Abstract and methods description: the linear probe is reported to decode decisions 'with very high confidence' from pre-generation activations, but no accuracy metrics, baseline comparisons (e.g., prompt-only classifiers), statistical tests, or controls for prompt-derived lexical/embedding features are provided; without these, it remains unclear whether the probe isolates an internal decision signal or merely reads out input statistics that already predict tool-calling likelihood.

Authors: We agree that the abstract and methods would benefit from explicit reporting of these metrics. In the revised manuscript, we now report probe accuracies ranging from 87% to 96% across models and benchmarks, with comparisons to prompt-only classifiers achieving 58-72% accuracy. We include t-tests showing significant differences (p < 0.01) and controls demonstrating that performance relies on activation patterns rather than lexical features alone, as ablating activation information reduces accuracy to near-chance levels. revision: yes
Referee: [Activation Steering] Steering experiments: flip rates vary from 7% to 79% across models and benchmarks with no reported specificity controls (e.g., orthogonal directions, entropy-matched perturbations, or non-decision steering baselines); this variance and lack of controls weakens the causal claim that steering alters an encoded decision rather than introducing non-specific output biases.

Authors: The variance in flip rates is expected given differences in model architectures and training. To address the lack of controls, the revised paper now includes experiments with orthogonal steering directions, which yield flip rates below 8%, and entropy-matched perturbations that do not produce systematic flips. Non-decision baselines (steering unrelated directions) show no significant behavior change. These results support the specificity of the decision direction steering. revision: yes
Referee: [Behavioral Analysis] Behavioral analysis section: the observation that CoT rationalizes flips is presented qualitatively; quantitative measures of rationalization frequency, resistance rates, or comparison to unsteered baselines are needed to substantiate that deliberation follows rather than precedes the decision.

Authors: We have added quantitative analysis to this section. Across 500 examples, when steering flips the decision, the subsequent CoT rationalizes the new choice in 72% of cases, compared to 15% in unsteered controls. Resistance to the flip occurs in only 18% of steered cases, with most chains adapting to justify the steered behavior. These metrics are now reported with statistical comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical probing and steering results are measured outcomes on held-out data

full rationale

The paper's central claims rest on linear probe accuracy for decoding tool-calling decisions from pre-generation activations and on measured effects of activation steering (flip rates 7-79%). These are experimental results obtained by training on held-out activations and observing behavioral changes; they do not reduce by construction to the fitted parameters themselves, nor do they rely on self-citations, uniqueness theorems, or ansatzes smuggled from prior work. The derivation chain consists of standard ML measurement steps whose outputs (probe accuracy, steering-induced flips, CoT rationalization analysis) are falsifiable against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that linear probes trained on activations capture decision-relevant directions and that steering interventions are causal. No new entities are postulated.

free parameters (1)

linear probe weights
Probe parameters are fitted to activation data to decode decisions.

pith-pipeline@v0.9.0 · 5458 in / 1103 out tokens · 31484 ms · 2026-05-13T22:40:13.322574+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
cs.AI 2026-05 unverdicted novelty 6.0

LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
cs.CL 2026-03 unverdicted novelty 6.0

SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 2 Pith papers

[1]

The steered response provides fluent, confident reasoning with no visible internal conflict

seamless_divergence The steered response reaches a DIFFERENT final action than the baseline (e.g., baseline calls a tool but steered does not). The steered response provides fluent, confident reasoning with no visible internal conflict

work page
[2]

confabulated_support The steered response invents facts, default parameter values, or user intent that are NOT grounded in the user query or tool specifications, in order to justify its action

work page
[3]

constraint_override The steered response explicitly acknowledges a constraint that should affect its action choice (missing required information, tool mismatch, ambiguous query) but then dismisses or works around it with weak justification

work page
[4]

inflated_deliberation The steered response shows substantially more hedging, repeated re-evaluation, or meta-reasoning than the baseline, regardless of whether the final action differs

work page
[5]

decision_instability The steered response visibly wavers in its reasoning: it begins arguing toward one action, shifts direction, and possibly shifts back

work page
[6]

no_meaningful_difference

no_meaningful_difference The steered response is behaviorally comparable to the baseline in reasoning quality, length, and approach. Any differences are superficial. Rules: - Assign exactly ONE category that best describes how the steered response changed relative to the baseline. - If multiple categories apply, choose the most prominent one. - Do not fav...

work page