The Shape of Wisdom: Decision Trajectories in Language Models

Shailesh Rana

arxiv: 2606.01202 · v1 · pith:MTXJQLFYnew · submitted 2026-05-31 · 💻 cs.AI · cs.CL· cs.LG

The Shape of Wisdom: Decision Trajectories in Language Models

Shailesh Rana This is my paper

Pith reviewed 2026-06-28 17:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords decision trajectorieslanguage modelsMMLUattentionMLPanswer marginstabilityinterpretability

0 comments

The pith

Language models reach most correct answers through unstable trajectories across layers rather than stable ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how answer scores evolve through the layers of three language models on 9,000 MMLU questions. It introduces three measures for each trajectory: the current margin for the answer, how that margin changes in the next layer, and how close it is to flipping. The key finding is that unstable-correct answers form the largest group, exceeding stable-correct ones. In the stable-correct cases, attention scalars on average support the correct direction while MLP scalars do not. Span deletion tests further show that text supporting the answer boosts the margin and distractor text reduces it.

Core claim

Across 9,000 trajectories in Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3 on MMLU, correctness and stability diverge, with unstable-correct being the dominant category. In stable-correct trajectories the average attention scalar aligns with the correct answer while the average MLP scalar does not. Removing answer-supporting spans decreases the margin and removing distractor-like spans increases it. This yields a practical way to classify answers as settled, fragile, or moved by specific sources.

What carries the argument

The three quantities used to describe each decision trajectory: current answer margin, next-layer change in margin, and distance from a decision flip.

If this is right

Correct answers are frequently not stable across model depth.
Attention contributes positively to correct margins in stable cases while MLPs do not.
Text spans supporting the answer increase the margin when present.
Models can be correct without the decision being firmly settled early in the network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These trajectories might identify cases where the model is relying on shallow patterns rather than deep reasoning.
Similar tracking could apply to other benchmarks to compare model robustness.
Targeting attention adjustments might increase the proportion of stable-correct trajectories.
Layer-wise margin tracking offers a lightweight alternative to full mechanistic interpretability for decision analysis.

Load-bearing premise

The current answer margin, its next-layer change, and the distance to a decision flip together capture the essential dynamics of the model's answer selection process.

What would settle it

A replication study on the same or similar models and dataset that finds the stable-correct group to be the largest or shows no consistent directional difference between attention and MLP scalars in stable cases.

Figures

Figures reproduced from arXiv: 2606.01202 by Shailesh Rana.

**Figure 2.** Figure 2: Trajectory regimes expose heterogeneity behind endpoint accuracy. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention and MLP scalars give useful one-step accounting of margin drift. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Span deletion separates operational evidence from distractors. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Counterfactual accounting is useful but protocol-sensitive. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tracks margin trajectories on MMLU to separate stable from unstable correct answers, but the partition and follow-up claims rest on details not visible in the abstract.

read the letter

The paper measures how the logit margin for the selected answer evolves layer by layer in three 7-8B instruct models on MMLU. It summarizes each of the 9,000 trajectories with current margin, next-layer delta, and distance to a flip, then reports that unstable-correct cases form the largest group. A smaller traced set then checks whether average attention or MLP scalars align with the correct direction and runs span deletions.

The method itself is straightforward and could be reproduced from the description. Reporting the split between stable-correct and unstable-correct directly addresses a practical question about when an answer is settled before the final layer. The attention-versus-MLP comparison and the deletion test are concrete attempts to link the scalars to sources inside the model.

The main limitation is that the abstract supplies no per-category counts, variance estimates, exact computation rules for the three quantities, or checks against non-monotonic paths. The stress-test point is on target: if margin is computed only on the final token and flip distance is a simple cumulative threshold, the stable/unstable label could shift with head-level variation or intermediate token effects, and the attention-MLP result would inherit that uncertainty. Without the full methods and tables it is not possible to judge how robust the partition actually is.

This is for people who need a lightweight way to flag fragile answers in deployed models. It is worth sending to referees so the methods and numbers can be examined, but the current write-up does not yet let a reader verify the headline split.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical analysis of 9,000 decision trajectories on MMLU across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3. Each trajectory is summarized by three scalars: current answer margin (final-token logit difference), next-layer change in that margin, and distance to a decision flip. The central finding is that correctness and stability are distinct, with the largest group being unstable-correct rather than stable-correct. In a traced subset of stable-correct cases, average attention scalars align with the correct direction while average MLP scalars do not; span-deletion experiments further show that removing answer-supporting text decreases the margin and removing distractor text increases it.

Significance. If the three-quantity partition is robust, the work supplies a reproducible empirical method for identifying settled versus fragile answers inside language models and for isolating the directional contributions of attention versus MLP layers. This is a modest but concrete step toward trajectory-level interpretability; the absence of a full circuit explanation is explicitly acknowledged.

major comments (2)

[Trajectory characterization and results sections] The claim that the largest group is unstable-correct rests entirely on partitioning trajectories with the three scalars (current margin, next-layer delta, distance to flip). The manuscript provides no validation that these scalars are sufficient to capture key dynamics; non-monotonic margin trajectories, per-head attention variation, or intermediate-token contributions could produce different classifications. This directly affects both the group-size result and the subsequent attention-vs-MLP comparison.
[Traced-subset analysis] The attention/MLP directional finding and the span-deletion results are reported only on the subset already classified by the same three-quantity partition. Any misclassification therefore propagates to the mechanistic claims.

minor comments (2)

[Abstract] The abstract states the headline findings and study size but supplies no numerical values, error bars, exclusion criteria, or verification steps for the reported group sizes or scalar averages.
[Methods] Notation for the three quantities (margin, next-layer change, flip distance) is introduced without explicit equations or pseudocode, making exact reproduction difficult from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate where revisions will be made to strengthen the validation of our trajectory partitioning.

read point-by-point responses

Referee: [Trajectory characterization and results sections] The claim that the largest group is unstable-correct rests entirely on partitioning trajectories with the three scalars (current margin, next-layer delta, distance to flip). The manuscript provides no validation that these scalars are sufficient to capture key dynamics; non-monotonic margin trajectories, per-head attention variation, or intermediate-token contributions could produce different classifications. This directly affects both the group-size result and the subsequent attention-vs-MLP comparison.

Authors: We acknowledge that the three scalars constitute a simplified partition and that the manuscript does not include explicit checks against non-monotonic margin trajectories, per-head attention variation, or intermediate-token effects. These scalars were selected because they directly quantify the evolution of the answer margin, which is the central object of the stability analysis. In revision we will add a supplementary quantification of non-monotonic trajectories across the 9,000-trajectory corpus and report their effect on the reported group sizes; this constitutes a partial revision because the core empirical picture remains anchored to the proposed scalars. revision: partial
Referee: [Traced-subset analysis] The attention/MLP directional finding and the span-deletion results are reported only on the subset already classified by the same three-quantity partition. Any misclassification therefore propagates to the mechanistic claims.

Authors: We agree that the traced-subset analyses inherit any classification limitations of the three-scalar partition. The span-deletion experiments provide an independent probe of directional contributions, but we will revise the text to state explicitly that the attention-versus-MLP and span-deletion results are conditional on the stable-correct classification and to discuss the implications of potential misclassification. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical measurements of defined quantities

full rationale

The paper defines three scalar quantities directly from model logits (current answer margin, next-layer change in margin, distance to decision flip) and uses them to partition observed trajectories into empirical categories such as unstable-correct. No derivations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatzes are present. The central claim follows from direct computation on the 9,000-trajectory dataset rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no explicit free parameters, axioms, or invented entities; the work is descriptive of observed trajectories.

pith-pipeline@v0.9.1-grok · 5710 in / 940 out tokens · 32043 ms · 2026-06-28T17:16:47.475183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 5 internal anchors

[1]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[6]

Advances in Neural Information Processing Systems , year=

Locating and Editing Factual Associations in GPT , author=. Advances in Neural Information Processing Systems , year=
[7]

International Conference on Learning Representations , year=

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small , author=. International Conference on Learning Representations , year=
[8]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

2022
[9]

The Llama 3 Herd of Models

AI at Meta . The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022
[12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations, 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lelio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothee Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems, 2022

2022
[15]

Qwen2.5 Technical Report

Qwen Team . Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Interpretability in the wild: A circuit for indirect object identification in gpt-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. In International Conference on Learning Representations, 2023

2023

[1] [1]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[2] [6]

Advances in Neural Information Processing Systems , year=

Locating and Editing Factual Associations in GPT , author=. Advances in Neural Information Processing Systems , year=

[3] [7]

International Conference on Learning Representations , year=

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small , author=. International Conference on Learning Representations , year=

[4] [8]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

2022

[5] [9]

The Llama 3 Herd of Models

AI at Meta . The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [10]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [11]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

2022

[8] [12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations, 2021. arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [13]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lelio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothee Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [14]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems, 2022

2022

[11] [15]

Qwen2.5 Technical Report

Qwen Team . Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [16]

Interpretability in the wild: A circuit for indirect object identification in gpt-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. In International Conference on Learning Representations, 2023

2023