The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

Guillaume Rabusseau; Marius Mosbach; Michael Rizvi-Martel

arxiv: 2604.06374 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.LG

The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

Michael Rizvi-Martel , Guillaume Rabusseau , Marius Mosbach This is my paper

Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords latent chain-of-thoughtsuperpositionlanguage modelsreasoninginternal representationstraining regimesshortcut solutions

0 comments

The pith

Language models only exhibit superposition in latent chain-of-thought reasoning when trained from scratch on the task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether continuous latent reasoning lets models hold multiple candidate answers at once inside a single hidden state. It compares three setups: constructing latent thoughts from a frozen model, fine-tuning a pretrained model to produce them, and training an entirely new model from random weights using only latent thoughts. Probing with Logit Lens and entity-level checks shows superposition only in the from-scratch case; the other two regimes either collapse to one answer or bypass the mechanism with shortcuts. The difference traces to pretraining biases that push later layers toward single-token commitments and to capacity limits that favor simpler solutions.

Core claim

Only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, the superposition either collapses or is not used at all, with models discovering shortcut solutions instead, because pretraining on natural language biases models to commit to a token in the last layers and because capacity has a huge effect on which solutions a model favors.

What carries the argument

Comparison across training-free, fine-tuned, and from-scratch regimes, tracked by Logit Lens and entity-level probing of internal representations to detect whether multiple candidate solutions remain active simultaneously.

If this is right

Superposition in latent reasoning requires full task-specific training from random initialization rather than adaptation of existing weights.
Pretraining on natural language data creates a bias toward early token commitment that prevents maintenance of multiple hypotheses.
Model capacity determines whether a network discovers and maintains complex representations like superposition or defaults to shortcuts.
Practical latent CoT systems built by fine-tuning will rarely realize the hypothesized expressivity gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because full retraining is expensive, the conditions for superposition are unlikely to appear in most deployed systems.
Hybrid approaches that combine limited task-specific training with techniques to preserve multi-hypothesis representations might induce superposition without starting from scratch.
The same regime-dependent collapse could appear in other continuous reasoning formats that rely on adapting pretrained models.

Load-bearing premise

That Logit Lens and entity-level probing can reliably detect the presence or absence of superposition in the model's internal representations across the three regimes.

What would settle it

A fine-tuned or training-free model that keeps multiple distinct token predictions active across intermediate layers without collapsing to a single commitment in the final layers would contradict the reported absence of superposition in those regimes.

Figures

Figures reproduced from arXiv: 2604.06374 by Guillaume Rabusseau, Marius Mosbach, Michael Rizvi-Martel.

**Figure 1.** Figure 1: Two latent CoT approaches. Left: Coconut feeds the last hidden state directly back as the next input embedding, forming a recurrent loop in continuous space. Right: Soft Thinking computes a probability distribution over the vocabulary and forms the next input as a weighted sum of token embeddings. Both methods replace discrete reasoning tokens with continuous representations, but differ in how these repres… view at source ↗

**Figure 2.** Figure 2: Superposition collapses early in the forward pass of QwQ-32B. (a) Shannon entropy shows identical patterns for Soft Thinking (orange) and discrete CoT (blue), both converging to near-zero entropy at the same rate. (b) KL divergence drops to ∼ 10−4 in middle layers, showing soft thinking tokens become functionally identical to discrete tokens within the first few layers. The uncertainty in soft thinking tok… view at source ↗

**Figure 3.** Figure 3: Step-aware entity belief for fine-tuned GPT-2 on 5-step ProsQA examples (normal [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Step-aware entity belief for from-scratch 2-layer models on 4-step ProsQA exam [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Entropy of synthetic uniform superpositions across layers (Qwen2-1.5B). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Step-aware entity belief across model depths (4-step ProsQA examples, from [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Top 3 tokens at the output layer for time steps with largest (top) and smallest [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Top 3 predicted tokens at the output layer for time steps with largest (top) and [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Dataset-averaged intervention metrics for [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: QwQ-32B on AIME2024 (N=30). Entropy and KL divergence by layer. 0 5 10 15 20 25 Layer Index 2 3 4 5 6 7 8 9 10 Shannon Entropy Entropy by Layer: Soft Thinking vs Discrete CoT MATH500 Qwen2-1.5B (N=500) Soft Thinking Discrete CoT (a) Entropy comparison 0 5 10 15 20 25 Layer Index 10 4 10 3 10 2 10 1 10 0 10 1 KL Divergence KL Divergence by Layer: KL(Soft Discrete) MATH500 (N=500) KL(soft discrete) KL(soft … view at source ↗

**Figure 17.** Figure 17: Qwen2-1.5B on MATH500 (N=500). Entropy and KL divergence by layer. 0 5 10 15 20 25 Layer Index 2 3 4 5 6 7 8 9 10 Shannon Entropy Entropy by Layer: Soft Thinking vs Discrete CoT GSM8K Qwen2-1.5B (N=500) Soft Thinking Discrete CoT (a) Entropy comparison 0 5 10 15 20 25 Layer Index 10 4 10 3 10 2 10 1 10 0 10 1 KL Divergence KL Divergence by Layer: KL(Soft Discrete) GSM8K (N=500) KL(soft discrete) KL(soft u… view at source ↗

**Figure 18.** Figure 18: Qwen2-1.5B on GSM8K (N=500). Entropy and KL divergence by layer. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Qwen2-1.5B on AIME2024 (N=30). Entropy and KL divergence by layer. 0 10 20 30 40 50 60 70 80 Layer Index 4 5 6 7 8 9 10 11 Shannon Entropy Entropy by Layer: Soft Thinking vs Discrete CoT MATH500 (N=500) Soft Thinking Discrete CoT (a) Entropy comparison 0 10 20 30 40 50 60 70 80 Layer Index 10 2 10 1 10 0 10 1 KL Divergence KL Divergence by Layer: KL(Soft Discrete) MATH500 (N=500) KL(soft discrete) KL(soft… view at source ↗

**Figure 20.** Figure 20: DeepSeek-R1-Distill-Llama-70B on MATH500 (N=500). Entropy and KL divergence by layer. Logit lens is applied at 5 evenly spaced layers {0, 20, 40, 60, 79} of the 80-layer model. 0 10 20 30 40 50 60 70 80 Layer Index 4 5 6 7 8 9 10 11 Shannon Entropy Entropy by Layer: Soft Thinking vs Discrete CoT AIME 2024 (N=30) Soft Thinking Discrete CoT (a) Entropy comparison 0 10 20 30 40 50 60 70 80 Layer Index 10 2 … view at source ↗

**Figure 21.** Figure 21: DeepSeek-R1-Distill-Llama-70B on AIME2024 (N=30). Entropy and KL divergence by layer. Same 5-layer probe as [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗

**Figure 22.** Figure 22: Logit Lens entropy at reasoning positions for fine-tuned GPT-2 (left, 12 layers) [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: Normalized entity probability mass at each reasoning step for 4-step ProsQA [PITH_FULL_IMAGE:figures/full_fig_p022_23.png] view at source ↗

**Figure 24.** Figure 24: Step-aware entity belief across model depths for 3-step ProsQA examples (from [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗

**Figure 25.** Figure 25: Step-aware entity belief for the 3.4% of ProsQA test examples (17/500) that [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗

**Figure 26.** Figure 26: Normalized entity probability mass at each reasoning step on ProntoQA (fine [PITH_FULL_IMAGE:figures/full_fig_p023_26.png] view at source ↗

**Figure 27.** Figure 27: Normalized probability of the correct label (“True” or “False”) across latent [PITH_FULL_IMAGE:figures/full_fig_p024_27.png] view at source ↗

**Figure 28.** Figure 28: Mean L2 gradient norms per parameter group during Coconut fine-tuning on ProsQA (GPT-2). Each panel shows one parameter group: word token embeddings (wte + lm head), attention, MLP, LayerNorm, and positional embeddings (wpe). Alternating shaded bands denote the five-epoch training stages of the Coconut curriculum. Note that gradient magnitudes remain non-trivial across all groups and stages, indicating th… view at source ↗

**Figure 29.** Figure 29: Logit Lens entropy at reasoning positions for fine-tuned GPT-2 (left) and a from [PITH_FULL_IMAGE:figures/full_fig_p025_29.png] view at source ↗

read the original abstract

Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds superposition only in from-scratch latent CoT models while others shortcut or collapse, but the claim rests on probes whose reliability is not yet clear.

read the letter

The main takeaway is that only models trained from scratch on latent chain-of-thought tasks show signs of maintaining multiple candidate solutions in one representation. Training-free constructions and fine-tuned models either lose the superposition or bypass it with shortcuts, which the authors link to pretraining biases that force early token commitment and to limits in model capacity. That regime comparison is the clearest new element. Earlier latent reasoning papers usually examine one setup in isolation, so seeing the pattern across all three and tying it to known pretraining effects gives a more unified picture than what was available before. The mechanistic story they sketch is straightforward and connects to existing observations about how language models behave in their final layers. The work is useful for anyone trying to decide when continuous representations are likely to deliver extra reasoning power rather than just extra parameters. The soft spots sit in the measurement. The findings depend on Logit Lens and entity-level probing, yet the abstract supplies no quantitative scores, controls, or checks against known weaknesses in those tools. Logit Lens can surface token-level uncertainty without proving simultaneous encoding of multiple options, and entity probing can recover signals from averaged or sequential states. The stress-test concern about sensitivity holds weight here; without stronger interventions or ablations in the full text, the reported distinction between regimes could partly reflect the analysis method rather than a true difference in internal strategy. This paper is aimed at researchers working on mechanistic interpretability and latent reasoning methods. A reader already thinking about training regimes for continuous thought would find the capacity and bias points worth testing. I would send it to peer review. The question is timely, the empirical contrast is a reasonable starting point, and referees can push for tighter validation of the probes without the work being fundamentally off track.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes whether language models employ superposition in latent continuous chain-of-thought reasoning. It compares three training regimes—training-free construction of latent thoughts, fine-tuning a base model, and training from scratch—and uses Logit Lens and entity-level probing to argue that only the from-scratch regime shows evidence of superposition, with the others collapsing to single solutions or shortcuts due to pretraining biases and capacity constraints.

Significance. Should the empirical distinctions hold under more rigorous validation, the results would provide a principled account of when and why superposition appears in latent reasoning, with implications for model training strategies and the design of continuous reasoning architectures. The identification of complementary phenomena (pretraining bias and capacity) is a notable contribution if substantiated.

major comments (2)

[Abstract] The abstract asserts specific findings regarding the use of superposition in different regimes but supplies no experimental details, controls, quantitative metrics, or error analysis. This makes it difficult to evaluate the robustness of the claim that only from-scratch models exhibit superposition.
[Analysis of internal representations] The central claim depends on Logit Lens and entity-level probing distinguishing superposition from collapsed or shortcut representations. However, the manuscript does not address documented limitations of Logit Lens in capturing internal computations or validate that entity-level probing can differentiate true simultaneous encoding from averaged or sequential processing. This is load-bearing for the regime comparisons.

minor comments (1)

[Abstract] The phrase 'signs of using superposition' in the abstract and results could be clarified by specifying the exact quantitative criteria or thresholds applied to the probing outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us clarify and strengthen the presentation of our results. We address each major comment point by point below, with revisions made to the manuscript where appropriate to improve robustness and transparency.

read point-by-point responses

Referee: [Abstract] The abstract asserts specific findings regarding the use of superposition in different regimes but supplies no experimental details, controls, quantitative metrics, or error analysis. This makes it difficult to evaluate the robustness of the claim that only from-scratch models exhibit superposition.

Authors: We agree that the abstract is a high-level summary and, as such, omits full experimental details by design. To address this concern, we have revised the abstract to include key quantitative metrics (e.g., average probing accuracies and superposition indices with standard deviations across multiple runs) and a brief mention of controls. Full details on experimental setups, controls, quantitative results, and error analysis remain in Sections 3 and 4, with additional tables in the appendix. This revision maintains brevity while providing readers with sufficient information to assess the claims. revision: yes
Referee: [Analysis of internal representations] The central claim depends on Logit Lens and entity-level probing distinguishing superposition from collapsed or shortcut representations. However, the manuscript does not address documented limitations of Logit Lens in capturing internal computations or validate that entity-level probing can differentiate true simultaneous encoding from averaged or sequential processing. This is load-bearing for the regime comparisons.

Authors: We acknowledge the documented limitations of Logit Lens (e.g., its tendency to reflect later-layer biases rather than full internal computations) and have added a dedicated discussion subsection (now Section 4.3) citing relevant prior work on these issues, along with how our multi-method approach (combining Logit Lens with probing) mitigates them. For entity-level probing, we have included new validation experiments in Appendix C: we construct synthetic baselines for averaged and sequential representations and show that our probing method yields distinct signatures for true superposition (simultaneous multi-entity activation) versus these alternatives. These additions directly support the validity of our regime comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical observational study

full rationale

The paper conducts an empirical analysis across training-free, fine-tuned, and from-scratch regimes, using Logit Lens and entity-level probing to measure internal representations and observe differences in superposition behavior. No derivations, equations, fitted parameters renamed as predictions, or self-citations that bear the load of the central claims are present. All conclusions rest on external experimental measurements and observations rather than reducing to self-definitions or ansatzes by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical investigation that relies on existing probing methods and does not introduce or depend on new mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5546 in / 1173 out tokens · 45954 ms · 2026-05-10T19:56:11.147333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

0.0576 Argmax Top-3:

work page
[2]

0.0654 Maximum KL GSM8K 42 KL = 4.7812 t = 1050 Soft Top-3:

work page
[3]

( 0.0057 Argmax Top-3:

work page
[4]

C 0.0508 MATH500 10 KL = 4.9062 t = 800 Soft Top-3:

work page
[5]

0.0952 Argmax Top-3:

work page
[6]

( 0.0188 MATH500 55 KL = 0.7031 t = 950 Soft Top-3:

work page
[7]

( 0.1270 Argmax Top-3:

work page
[8]

0.0605 AIME 3 KL = 2.9062 t = 1450 Soft Top-3:

work page
[9]

A 0.1416 Argmax Top-3:

work page
[10]

( 0.0938 AIME 5 KL = -0.0027 t = 150 Soft Top-3:

work page
[11]

: 0.0311 Argmax Top-3:

work page
[12]

0.0315 Minimum KL KL = -0.0037 t = 1000 Soft Top-3:

work page
[13]

, 0.0684 Argmax Top-3:

work page
[14]

, 0.0757 KL = 0.0000 t = 50 Soft Top-3:

work page
[15]

( 0.0122 Argmax Top-3:

work page
[16]

( 0.0122 KL = -0.0041 t = 1250 Soft Top-3:

work page
[17]

0.0703 Argmax Top-3:

work page
[18]

0.0649 KL = -0.0028 t = 1150 Soft Top-3:

work page
[19]

a 0.0223 Argmax Top-3:

work page
[20]

Each column represents a problem instance

A 0.0248 Token Predictions at Final Layer (L=63): Extreme KL Divergence Figure 7: Top 3 tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations inQwQ-32B. Each column represents a problem instance. C.2 Full-Dataset Logit Lens Results In this section we present visualizations ...

work page
[21]

function 0.0233 Argmax Top-3:

work page
[22]

number 0.0116 Maximum KL GSM8K 42 KL = 0.1602 t = 50 Soft Top-3:

work page
[23]

in 0.0332 Argmax Top-3:

work page
[24]

is 0.0479 MATH500 10 KL = 1.1094 t = 350 Soft Top-3:

work page
[25]

number 0.0238 Argmax Top-3:

work page
[26]

number 0.0116 MATH500 55 KL = 0.1543 t = 850 Soft Top-3:

work page
[27]

first 0.0112 Argmax Top-3:

work page
[28]

first 0.0120 AIME 3 KL = 1.9219 t = 500 Soft Top-3:

work page
[29]

a 0.0303 Argmax Top-3:

work page
[30]

post 0.0374 AIME 5 KL = 0.0000 t = 150 Soft Top-3:

work page
[31]

1 0.0840 Argmax Top-3:

work page
[32]

1 0.0840 Minimum KL KL = -0.0001 t = 1100 Soft Top-3:

work page
[33]

// 0.0806 Argmax Top-3:

work page
[34]

// 0.0791 KL = -0.0006 t = 300 Soft Top-3:

work page
[35]

3 0.1064 Argmax Top-3:

work page
[36]

3 0.1045 KL = 0.0000 t = 50 Soft Top-3:

work page
[37]

? 0.0854 Argmax Top-3:

work page
[38]

? 0.0854 KL = 0.0000 t = 300 Soft Top-3:

work page
[39]

? 0.0986 Argmax Top-3:

work page
[40]

True” or “False

? 0.0986 Token Predictions at Final Layer (L=27): Extreme KL Divergence Figure 8: Top 3 predicted tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations inQwen2-1.5B. Each column represents a problem instance. 0% 26% 52% 78% 100% Relative T oken Position 0 16 32 48 63Layer K...

work page 2024

[1] [1]

0.0576 Argmax Top-3:

work page

[2] [2]

0.0654 Maximum KL GSM8K 42 KL = 4.7812 t = 1050 Soft Top-3:

work page

[3] [3]

( 0.0057 Argmax Top-3:

work page

[4] [4]

C 0.0508 MATH500 10 KL = 4.9062 t = 800 Soft Top-3:

work page

[5] [5]

0.0952 Argmax Top-3:

work page

[6] [6]

( 0.0188 MATH500 55 KL = 0.7031 t = 950 Soft Top-3:

work page

[7] [7]

( 0.1270 Argmax Top-3:

work page

[8] [8]

0.0605 AIME 3 KL = 2.9062 t = 1450 Soft Top-3:

work page

[9] [9]

A 0.1416 Argmax Top-3:

work page

[10] [10]

( 0.0938 AIME 5 KL = -0.0027 t = 150 Soft Top-3:

work page

[11] [11]

: 0.0311 Argmax Top-3:

work page

[12] [12]

0.0315 Minimum KL KL = -0.0037 t = 1000 Soft Top-3:

work page

[13] [13]

, 0.0684 Argmax Top-3:

work page

[14] [14]

, 0.0757 KL = 0.0000 t = 50 Soft Top-3:

work page

[15] [15]

( 0.0122 Argmax Top-3:

work page

[16] [16]

( 0.0122 KL = -0.0041 t = 1250 Soft Top-3:

work page

[17] [17]

0.0703 Argmax Top-3:

work page

[18] [18]

0.0649 KL = -0.0028 t = 1150 Soft Top-3:

work page

[19] [19]

a 0.0223 Argmax Top-3:

work page

[20] [20]

Each column represents a problem instance

A 0.0248 Token Predictions at Final Layer (L=63): Extreme KL Divergence Figure 7: Top 3 tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations inQwQ-32B. Each column represents a problem instance. C.2 Full-Dataset Logit Lens Results In this section we present visualizations ...

work page

[21] [21]

function 0.0233 Argmax Top-3:

work page

[22] [22]

number 0.0116 Maximum KL GSM8K 42 KL = 0.1602 t = 50 Soft Top-3:

work page

[23] [23]

in 0.0332 Argmax Top-3:

work page

[24] [24]

is 0.0479 MATH500 10 KL = 1.1094 t = 350 Soft Top-3:

work page

[25] [25]

number 0.0238 Argmax Top-3:

work page

[26] [26]

number 0.0116 MATH500 55 KL = 0.1543 t = 850 Soft Top-3:

work page

[27] [27]

first 0.0112 Argmax Top-3:

work page

[28] [28]

first 0.0120 AIME 3 KL = 1.9219 t = 500 Soft Top-3:

work page

[29] [29]

a 0.0303 Argmax Top-3:

work page

[30] [30]

post 0.0374 AIME 5 KL = 0.0000 t = 150 Soft Top-3:

work page

[31] [31]

1 0.0840 Argmax Top-3:

work page

[32] [32]

1 0.0840 Minimum KL KL = -0.0001 t = 1100 Soft Top-3:

work page

[33] [33]

// 0.0806 Argmax Top-3:

work page

[34] [34]

// 0.0791 KL = -0.0006 t = 300 Soft Top-3:

work page

[35] [35]

3 0.1064 Argmax Top-3:

work page

[36] [36]

3 0.1045 KL = 0.0000 t = 50 Soft Top-3:

work page

[37] [37]

? 0.0854 Argmax Top-3:

work page

[38] [38]

? 0.0854 KL = 0.0000 t = 300 Soft Top-3:

work page

[39] [39]

? 0.0986 Argmax Top-3:

work page

[40] [40]

True” or “False

? 0.0986 Token Predictions at Final Layer (L=27): Extreme KL Divergence Figure 8: Top 3 predicted tokens at the output layer for time steps with largest (top) and smallest (bottom) KL divergence between soft and argmax representations inQwen2-1.5B. Each column represents a problem instance. 0% 26% 52% 78% 100% Relative T oken Position 0 16 32 48 63Layer K...

work page 2024