Recognition: unknown
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3
The pith
LLMs lose track of instructions in multi-turn chats when attention to goal tokens fades, though residual streams may still encode the needed information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose that goal-defining tokens become less accessible through attention while goal-related information persists in residual representations. Using the Goal Accessibility Ratio together with sliding-window ablations and residual-stream probes, we show that the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure across architectures and scales. In a causal ablation on Mistral, force-closing the attention channel collapses recall from near-perfect to 11 percent on a 20-fact task and raises persona-constraint violations, with both effects appearing at the predicted crossover turn. Linear probes recover per-episode recall,
What carries the argument
The Goal Accessibility Ratio (GAR), which measures attention from generated tokens to task-defining goal tokens, combined with sliding-window ablations and linear probes on residual streams.
If this is right
- When attention to instructions closes, models show qualitatively distinct failure modes depending on architecture.
- Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99, while input embeddings stay at chance.
- Force-closing the attention channel in Mistral collapses recall to 11 percent on a 20-fact retention task.
- The layer at which goal encoding emerges in residuals varies from 2 to 27 across models.
- Failure timing becomes predictable under windowed attention closure.
Where Pith is reading between the lines
- Monitoring GAR during deployment could flag upcoming instruction drift before behavioral failure appears.
- Targeted interventions on residual streams might extend goal retention in models where residuals already encode the information.
- The same attention-versus-residual gap may explain degradation in other long-sequence tasks such as multi-step planning.
- Training methods that strengthen residual goal encoding could reduce reliance on sustained attention.
Load-bearing premise
Linear probes on residual streams recover causally relevant goal information rather than spurious correlations, and sliding-window ablations isolate the attention channel without confounding other mechanisms.
What would settle it
An experiment that keeps attention to goal tokens high while disrupting the residual goal encoding (or vice versa) and checks whether behavior follows the residual state or the attention state.
Figures
read the original abstract
Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a channel-transition account for how LLMs lose goal information during multi-turn interactions. It defines the Goal Accessibility Ratio (GAR) from attention weights to goal tokens, combines it with sliding-window ablations and linear probes on residual streams, and claims that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior across architectures. A causal ablation on Mistral shows recall collapsing to 11% under forced attention closure, with probes achieving AUC up to 0.99 on residuals versus chance on embeddings.
Significance. If the results hold, the work supplies a mechanistic framework distinguishing attention and residual channels for context degradation, with GAR as a practical diagnostic and architecture-specific predictions of failure timing. The within-model causal ablation and cross-architecture patterns are notable strengths that could inform context-management methods.
major comments (2)
- [Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.
- [Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.
minor comments (2)
- [Abstract and results] The abstract and results sections would benefit from explicit reporting of error bars, exact data splits, and statistical tests for the AUC values and ablation outcomes to support verification.
- [GAR definition] Clarify the precise formula and hyperparameters for GAR computation, including window size and attention aggregation, in the main text or an early methods subsection.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important points about causal validation and experimental controls. We address each major comment below and have made partial revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.
Authors: We appreciate this observation. The manuscript's central causal evidence is the direct within-model attention-closure intervention on Mistral, which forces the attention channel closed and collapses recall from near-perfect to 11% while elevating persona violations above the adversarial baseline. This intervention demonstrates that goal-conditioned behavior depends on the attention channel even when residuals retain decodable goal information (probes reach AUC 0.99 while embeddings remain at chance). The probes themselves are presented as providing predictive evidence of residual retention rather than as a standalone causal demonstration. We agree that activation patching or direction ablation on the probed residual directions would provide stronger causal support for the residual channel's role. We will revise the methods and discussion sections to explicitly frame the probes as correlational/predictive, clarify that the attention-closure ablation supplies the primary causal test of the gap, and add a limitations paragraph noting the absence of residual-direction interventions as an avenue for future work. revision: partial
-
Referee: [Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.
Authors: We acknowledge the potential for residual confounds in the sliding-window setup. The sliding-window ablations were intended to simulate progressive attention closure, but KV-cache statistics and positional encodings could indeed carry over information. Our strongest causal result, however, comes from the separate direct attention-forcing ablation on Mistral rather than from the sliding-window results alone. We will revise the methods section to explicitly discuss these possible confounds, add controls (e.g., verification that positional encodings remain intact and that KV-cache content is not artificially altered beyond attention masking), and clarify that cross-architecture patterns are interpreted in light of the direct causal intervention. This will better isolate the attention/residual distinction. revision: partial
Circularity Check
No significant circularity in the channel-transition account
full rationale
The paper defines GAR directly from attention weights to goal tokens and trains linear probes to classify per-episode recall outcomes from residual streams, reporting high AUC. The central claim—that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior—is an empirical cross-architecture correlation supported by independent behavioral measurements and a causal ablation (force-closing attention in Mistral, dropping recall to 11%). No step reduces a prediction to its inputs by construction: the probe AUC measures correlation with outcomes but does not mathematically force the survival rates or the gap's predictive power, which are evaluated separately across models and via the ablation. The derivation chain remains self-contained against external benchmarks rather than self-referential.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Attention weights directly measure accessibility of goal tokens to later generations
- domain assumption Residual stream representations can be linearly decoded for goal information without the attention channel
invented entities (2)
-
Goal Accessibility Ratio (GAR)
no independent evidence
-
channel-transition account
no independent evidence
Reference graph
Works this paper leans on
-
[1]
LLMs Get Lost In Multi-Turn Conversation
Llms get lost in multi-turn conversation , author=. arXiv preprint arXiv:2505.06120 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv preprint arXiv:2410.15553 , year=
Multi-if: Benchmarking llms on multi-turn and multilingual instructions following , author=. arXiv preprint arXiv:2410.15553 , year=
-
[3]
Transactions of the association for computational linguistics , volume=
Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
-
[4]
Neurocomputing , volume=
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
2024
-
[5]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Hope: A novel positional encoding without long-term decay for enhanced context awareness and extrapolation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[6]
Understanding intermediate layers using linear classifier probes
Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Computational Linguistics , volume=
Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
-
[8]
In-context Learning and Induction Heads
In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Transformer Circuits Thread , volume=
A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
-
[10]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
-
[11]
Fast Transformer Decoding: One Write-Head is All You Need
Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[12]
arXiv preprint arXiv:2511.03508 , year=
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework , author=. arXiv preprint arXiv:2511.03508 , year=
-
[13]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Followbench: A multi-level fine-grained constraints following benchmark for large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[14]
arXiv preprint arXiv:2403.20194 , year=
Convbench: A multi-turn conversation evaluation benchmark with hierarchical capability for large vision-language models , author=. arXiv preprint arXiv:2403.20194 , year=
-
[15]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Found in the middle: Permutation self-consistency improves listwise ranking in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[16]
Are you still on track!? catching llm task drift with activations
Are you still on track!? catching llm task drift with activations , author=. arXiv preprint arXiv:2406.00799 , volume=
-
[17]
arXiv preprint arXiv:2510.07777 , year=
Drift No More? Context Equilibria in Multi-Turn LLM Interactions , author=. arXiv preprint arXiv:2510.07777 , year=
-
[18]
Advances in neural information processing systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
-
[19]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[20]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Mt-eval: A multi-turn capabilities evaluation benchmark for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
2024
-
[21]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
^2 -bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2508.02016 , year=
Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations , author=. arXiv preprint arXiv:2508.02016 , year=
-
[24]
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER: What's the real context size of your long-context language models? , author=. arXiv preprint arXiv:2404.06654 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Efficient Streaming Language Models with Attention Sinks
Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Mistral 7B. arXiv 2023 , author=. arXiv preprint arXiv:2310.06825 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Extending Context Window of Large Language Models via Positional Interpolation
Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
YaRN: Efficient Context Window Extension of Large Language Models
Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Round and round we go! what makes rotary positional encodings useful?, 2025
Round and round we go! what makes rotary positional encodings useful? , author=. arXiv preprint arXiv:2410.06205 , year=
-
[33]
IEEE Access , year=
Quantifying Generation Quality for RoPE-Based Long Context Extrapolation , author=. IEEE Access , year=
-
[34]
When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,
When attention sink emerges in language models: An empirical view , author=. arXiv preprint arXiv:2410.10781 , year=
-
[35]
Attention sinks and compression valleys in llms are two sides of the same coin , author=. arXiv preprint arXiv:2510.06477 , year=
-
[36]
Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
-
[37]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Longformer: The Long-Document Transformer
Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[39]
Advances in neural information processing systems , volume=
Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.