arxiv: 2605.12922 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CL

Recognition: unknown

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre , Joseph Hsieh , Viet Dac Lai , Seunghyun Yoon , Trung Bui , Dilek Hakkani-T\"ur

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords attention mechanismsmulti-turn interactionLLM forgettingresidual streamsgoal accessibilitychannel closuremechanistic interpretabilityinstruction following

0 comments

The pith

LLMs lose track of instructions in multi-turn chats when attention to goal tokens fades, though residual streams may still encode the needed information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models follow complex instructions in one turn but often drift from goals, personas, and rules over many turns. The paper explains this through a channel-transition account: attention to task-defining tokens drops while goal information can remain in residual representations. They measure the drop with the Goal Accessibility Ratio, then use sliding-window ablations and linear probes on residuals to track what survives. Across architectures, the gap between attention loss and residual decodability forecasts whether goal-conditioned behavior continues or collapses. Force-closing attention in one model drops recall sharply and raises rule violations at the exact turns the measures predict.

Core claim

We propose that goal-defining tokens become less accessible through attention while goal-related information persists in residual representations. Using the Goal Accessibility Ratio together with sliding-window ablations and residual-stream probes, we show that the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure across architectures and scales. In a causal ablation on Mistral, force-closing the attention channel collapses recall from near-perfect to 11 percent on a 20-fact task and raises persona-constraint violations, with both effects appearing at the predicted crossover turn. Linear probes recover per-episode recall,

What carries the argument

The Goal Accessibility Ratio (GAR), which measures attention from generated tokens to task-defining goal tokens, combined with sliding-window ablations and linear probes on residual streams.

If this is right

When attention to instructions closes, models show qualitatively distinct failure modes depending on architecture.
Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99, while input embeddings stay at chance.
Force-closing the attention channel in Mistral collapses recall to 11 percent on a 20-fact retention task.
The layer at which goal encoding emerges in residuals varies from 2 to 27 across models.
Failure timing becomes predictable under windowed attention closure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring GAR during deployment could flag upcoming instruction drift before behavioral failure appears.
Targeted interventions on residual streams might extend goal retention in models where residuals already encode the information.
The same attention-versus-residual gap may explain degradation in other long-sequence tasks such as multi-step planning.
Training methods that strengthen residual goal encoding could reduce reliance on sustained attention.

Load-bearing premise

Linear probes on residual streams recover causally relevant goal information rather than spurious correlations, and sliding-window ablations isolate the attention channel without confounding other mechanisms.

What would settle it

An experiment that keeps attention to goal tokens high while disrupting the residual goal encoding (or vice versa) and checks whether behavior follows the residual state or the attention state.

Figures

Figures reproduced from arXiv: 2605.12922 by Dilek Hakkani-T\"ur, Joseph Hsieh, Seunghyun Yoon, Trung Bui, Vardhan Dongre, Viet Dac Lai.

**Figure 1.** Figure 1: The two-channel framework. (A) The attention channel Cattn(τ ) from response tokens Rτ to goal tokens G decays with τ ; the residual channel Cres(τ ) at the response position carries goal information forward through layers. (B) GAR(τ ) crosses the model-specific threshold θM at τcross; for τ ≥ τcross direct attention to G is removed, so goal-related information that survives must be carried indirectly (e.g… view at source ↗

**Figure 2.** Figure 2: Overview of the four tasks: system-prompt goal, conversation schedule, and per-turn [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: GAR trajectories per architecture under default conversational pressure (left) and the sliding [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Window-size sweep on Mistral-7B and LLaMA3.1-8B. Each point is the empirical crossover turn τcross at the corresponding window size W. The Attention Channel Closes: Accessibility to goal-defining tokens (measured by GAR; see § 2.2 and Appendix B) declines monotonically with conversation turn across every LLM architecture we evaluate. Beyond the four primary models, this trend is supported by an extended … view at source ↗

**Figure 5.** Figure 5: Fact recall trajectories for the controlledcomplexity task under openchannel (default) and closedchannel conditions. With direct attention to goal tokens at numerical floor, the postcrossover regime across the architectures degrades but does not collapse completely. Across episodes, the model recovers some facts under the all-at-once probe (T = 50) even with no direct attention to those facts during r… view at source ↗

**Figure 6.** Figure 6: Residual probe AUC by layer per architecture, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Conversation structure for Task 1 (information retention) & Task 2 (Controlled Complexity). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Conversation structure for Task 3 (persona compliance) under three pressure conditions. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Cross-architecture residual-probe AUC-by-layer curves at two fixed post-closure probe [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Qwen-2.5 cross-scale validation under SW= [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Mistral-7B behavioral analysis under default vs SW=4096 inference. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Inter-annotator reliability on the 50-transcript policy-compliance calibration set. Left: [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Confusion matrices for policy-compliance labeling on the 50-transcript calibration set. [PITH_FULL_IMAGE:figures/full_fig_p032_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt template used by the impartial policy-compliance judge. [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

read the original abstract

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tracks how attention to goal tokens fades over multi-turn chats while residuals keep decodable goal info, and claims the gap predicts whether behavior survives, with one clean ablation but only correlational support from the probes.

read the letter

The main takeaway is that attention to instructions drops in long conversations while residual streams can still hold goal-related signals, and the size of that gap lines up with whether the model keeps following the original instructions. They introduce GAR as a simple ratio of attention weights to goal tokens, combine it with sliding-window ablations, and run linear probes on residuals that reach AUC 0.99 for predicting per-episode recall. The Mistral experiment where they force-close the attention channel and watch recall fall from near-perfect to 11% is the clearest result, and it matches the predicted crossover turn. Patterns hold across four architectures with the layer of residual encoding shifting from early to late depending on the model. That cross-model consistency and the parametric timing prediction are the useful parts. The probes beat input embeddings by a wide margin, which at least shows the signal is not just in the surface tokens. The soft spot is the causal link. The probes classify outcomes after the fact, but there is no activation patching or direction ablation to show that perturbing the probed residual features actually changes generation once attention is closed. If the high AUC comes from spurious correlations with surface features that happen to co-occur with success, the gap would not mechanistically explain survival. The sliding-window ablations are presented as isolating attention, yet KV-cache statistics or positional effects could still leak through. This is the kind of work that belongs in a reading group for people doing mechanistic interpretability on long-horizon agents. It supplies a testable framing and a diagnostic that others can apply, even if the causal story needs tightening. I would send it to referees rather than desk-reject it so the methods and data splits get checked.

Referee Report

2 major / 2 minor

Summary. The paper introduces a channel-transition account for how LLMs lose goal information during multi-turn interactions. It defines the Goal Accessibility Ratio (GAR) from attention weights to goal tokens, combines it with sliding-window ablations and linear probes on residual streams, and claims that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior across architectures. A causal ablation on Mistral shows recall collapsing to 11% under forced attention closure, with probes achieving AUC up to 0.99 on residuals versus chance on embeddings.

Significance. If the results hold, the work supplies a mechanistic framework distinguishing attention and residual channels for context degradation, with GAR as a practical diagnostic and architecture-specific predictions of failure timing. The within-model causal ablation and cross-architecture patterns are notable strengths that could inform context-management methods.

major comments (2)

[Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.
[Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.

minor comments (2)

[Abstract and results] The abstract and results sections would benefit from explicit reporting of error bars, exact data splits, and statistical tests for the AUC values and ablation outcomes to support verification.
[GAR definition] Clarify the precise formula and hyperparameters for GAR computation, including window size and attention aggregation, in the main text or an early methods subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important points about causal validation and experimental controls. We address each major comment below and have made partial revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Residual-stream probes (abstract and methods)] The linear probes achieve AUC up to 0.99 for predicting per-episode recall from residuals, yet the manuscript reports only post-hoc classification accuracy without activation patching, causal tracing, or direction ablation to test whether perturbing the probed residual directions alters goal-conditioned generation under closed attention. This is load-bearing for the central claim that the attention-residual gap predicts behavioral survival.

Authors: We appreciate this observation. The manuscript's central causal evidence is the direct within-model attention-closure intervention on Mistral, which forces the attention channel closed and collapses recall from near-perfect to 11% while elevating persona violations above the adversarial baseline. This intervention demonstrates that goal-conditioned behavior depends on the attention channel even when residuals retain decodable goal information (probes reach AUC 0.99 while embeddings remain at chance). The probes themselves are presented as providing predictive evidence of residual retention rather than as a standalone causal demonstration. We agree that activation patching or direction ablation on the probed residual directions would provide stronger causal support for the residual channel's role. We will revise the methods and discussion sections to explicitly frame the probes as correlational/predictive, clarify that the attention-closure ablation supplies the primary causal test of the gap, and add a limitations paragraph noting the absence of residual-direction interventions as an avenue for future work. revision: partial
Referee: [Sliding-window ablations] The sliding-window ablations are presented as isolating the attention channel, but potential residual effects on KV-cache statistics or positional signals are not addressed or controlled, which risks confounding the attribution of distinct failure modes specifically to the attention/residual distinction.

Authors: We acknowledge the potential for residual confounds in the sliding-window setup. The sliding-window ablations were intended to simulate progressive attention closure, but KV-cache statistics and positional encodings could indeed carry over information. Our strongest causal result, however, comes from the separate direct attention-forcing ablation on Mistral rather than from the sliding-window results alone. We will revise the methods section to explicitly discuss these possible confounds, add controls (e.g., verification that positional encodings remain intact and that KV-cache content is not artificially altered beyond attention masking), and clarify that cross-architecture patterns are interpreted in light of the direct causal intervention. This will better isolate the attention/residual distinction. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the channel-transition account

full rationale

The paper defines GAR directly from attention weights to goal tokens and trains linear probes to classify per-episode recall outcomes from residual streams, reporting high AUC. The central claim—that the gap between attention loss and residual decodability predicts survival of goal-conditioned behavior—is an empirical cross-architecture correlation supported by independent behavioral measurements and a causal ablation (force-closing attention in Mistral, dropping recall to 11%). No step reduces a prediction to its inputs by construction: the probe AUC measures correlation with outcomes but does not mathematically force the survival rates or the gap's predictive power, which are evaluated separately across models and via the ablation. The derivation chain remains self-contained against external benchmarks rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The account rests on standard transformer assumptions plus the new claim that attention closure is the primary channel while residuals remain informative; no explicit free parameters are named in the abstract.

axioms (2)

domain assumption Attention weights directly measure accessibility of goal tokens to later generations
Central to GAR definition and channel-transition account
domain assumption Residual stream representations can be linearly decoded for goal information without the attention channel
Required for the probe results and survival prediction

invented entities (2)

Goal Accessibility Ratio (GAR) no independent evidence
purpose: Quantify attention from generated tokens to task-defining goal tokens
New diagnostic metric introduced in the paper
channel-transition account no independent evidence
purpose: Explain degradation as attention closure while residuals persist
Core explanatory framework of the paper

pith-pipeline@v0.9.0 · 5599 in / 1371 out tokens · 31482 ms · 2026-05-14T20:16:15.218921+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 25 canonical work pages · 16 internal anchors

[1]

LLMs Get Lost In Multi-Turn Conversation

Llms get lost in multi-turn conversation , author=. arXiv preprint arXiv:2505.06120 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2410.15553 , year=

Multi-if: Benchmarking llms on multi-turn and multilingual instructions following , author=. arXiv preprint arXiv:2410.15553 , year=

work page arXiv
[3]

Transactions of the association for computational linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the association for computational linguistics , volume=
[4]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[5]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Hope: A novel positional encoding without long-term decay for enhanced context awareness and extrapolation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[6]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
[8]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
[10]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[11]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911
[12]

arXiv preprint arXiv:2511.03508 , year=

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework , author=. arXiv preprint arXiv:2511.03508 , year=

work page arXiv
[13]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Followbench: A multi-level fine-grained constraints following benchmark for large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[14]

arXiv preprint arXiv:2403.20194 , year=

Convbench: A multi-turn conversation evaluation benchmark with hierarchical capability for large vision-language models , author=. arXiv preprint arXiv:2403.20194 , year=

work page arXiv
[15]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Found in the middle: Permutation self-consistency improves listwise ranking in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[16]

Are you still on track!? catching llm task drift with activations

Are you still on track!? catching llm task drift with activations , author=. arXiv preprint arXiv:2406.00799 , volume=

work page arXiv
[17]

arXiv preprint arXiv:2510.07777 , year=

Drift No More? Context Equilibria in Multi-Turn LLM Interactions , author=. arXiv preprint arXiv:2510.07777 , year=

work page arXiv
[18]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[20]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Mt-eval: A multi-turn capabilities evaluation benchmark for large language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[21]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

^2 -bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2508.02016 , year=

Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations , author=. arXiv preprint arXiv:2508.02016 , year=

work page arXiv
[24]

RULER: What's the Real Context Size of Your Long-Context Language Models?

RULER: What's the real context size of your long-context language models? , author=. arXiv preprint arXiv:2404.06654 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Efficient Streaming Language Models with Attention Sinks

Efficient streaming language models with attention sinks , author=. arXiv preprint arXiv:2309.17453 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Mistral 7B

Mistral 7B. arXiv 2023 , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Extending Context Window of Large Language Models via Positional Interpolation

Extending context window of large language models via positional interpolation , author=. arXiv preprint arXiv:2306.15595 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models , author=. arXiv preprint arXiv:2309.00071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Round and round we go! what makes rotary positional encodings useful?, 2025

Round and round we go! what makes rotary positional encodings useful? , author=. arXiv preprint arXiv:2410.06205 , year=

work page arXiv
[33]

IEEE Access , year=

Quantifying Generation Quality for RoPE-Based Long Context Extrapolation , author=. IEEE Access , year=
[34]

When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

When attention sink emerges in language models: An empirical view , author=. arXiv preprint arXiv:2410.10781 , year=

work page arXiv
[35]

Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

Attention sinks and compression valleys in llms are two sides of the same coin , author=. arXiv preprint arXiv:2510.06477 , year=

work page arXiv
[36]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
[37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer , author=. arXiv preprint arXiv:2004.05150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2004
[39]

Advances in neural information processing systems , volume=

Big bird: Transformers for longer sequences , author=. Advances in neural information processing systems , volume=