pith. machine review for the scientific record. sign in

arxiv: 2604.23371 · v1 · submitted 2026-04-25 · 💻 cs.LG

Recognition: unknown

When Context Sticks: Studying Interference in In-Context Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords in-context learninginterferencecontext stickinesscurriculum learningsynthetic regressiontransformer adaptationprompt interferencetask switching
0
0 comments X

The pith

Earlier examples in a prompt continue to interfere with a transformer's adaptation to later tasks during in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how prior examples in a prompt can stick and bias predictions even when new examples arrive for a different task. By training transformers on regression problems involving linear and quadratic functions under varied curricula, the authors measure error changes as the number of misleading and corrective examples varies. They discover that interference persists reliably, degrading performance based on the count of prior examples, and that the training sequence strongly influences how fast the model overcomes it. Readers might care because in-context learning powers quick adaptation in large models, making any stickiness a potential limit on their flexibility with mixed or changing contexts.

Core claim

Using controlled sweeps of linear followed by quadratic examples in prompts, the study demonstrates that more initial linear examples increase error in quadratic predictions, additional quadratic examples decrease it with diminishing returns, and sequential training on the target function enables the fastest recovery while random training yields the weakest resilience to interference.

What carries the argument

Persistent interference from preceding context, measured as the degradation in prediction accuracy when switching between linear and quadratic regression tasks in the prompt.

If this is right

  • More preceding examples from one function class will increase error when predicting the other class.
  • Error reduction from corrective examples slows after the first few additions.
  • Sequential training curricula produce models that recover quickest from context interference.
  • Random training curricula result in models with the poorest robustness to task switches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt design in practice should consider ordering to reduce the impact of earlier examples on later ones.
  • These dynamics might extend to natural language tasks, suggesting that long context windows could accumulate unwanted biases.
  • Alternative training methods could be explored to enhance resilience beyond the tested curricula.

Load-bearing premise

That results from these synthetic regression tasks with linear and quadratic functions generalize to the interference effects in real-world in-context learning with language models.

What would settle it

Finding that the number of preceding examples has no systematic effect on prediction error, or that all curricula show identical recovery rates, when tested on the same task switches or on actual language modeling prompts.

Figures

Figures reproduced from arXiv: 2604.23371 by Dagny Streit, Hanna R{\o}d, Justin Li, Nils Valseth Selte.

Figure 1
Figure 1. Figure 1: Overall 3D Error Surfaces view at source ↗
Figure 2
Figure 2. Figure 2: Recovery curves: holding the number of preceding linear examples constant view at source ↗
Figure 3
Figure 3. Figure 3: Stickiness curves: holding the number of following quadratic examples constant view at source ↗
Figure 4
Figure 4. Figure 4: Sequential: error versus quadratic examples view at source ↗
Figure 5
Figure 5. Figure 5: Sequential curriculum: error when switching between tasks view at source ↗
Figure 6
Figure 6. Figure 6: Mixed and random curricula: error when switching between tasks view at source ↗
read the original abstract

This paper investigates context stickiness in in-context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer's ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study of context stickiness in in-context learning (ICL) using transformers trained on synthetic linear and quadratic regression tasks under sequential, mixed, and random curricula. It examines performance on inference prompts with abrupt switches from misleading linear examples to target quadratic examples, claiming to show persistent interference where additional preceding linear examples degrade quadratic predictions, with diminishing returns from recovery examples, and curricula modulating adaptation speed (sequential best, random worst).

Significance. The controlled synthetic experiments offer a clean way to isolate and quantify interference effects in ICL, which could help explain transformer adaptation mechanisms if the patterns prove robust. The curriculum comparisons are a useful angle for training design. However, the absence of any scaling or natural-language validation substantially limits the significance for understanding ICL in actual large language models.

major comments (3)
  1. [§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.
  2. [§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.
  3. [§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.
minor comments (2)
  1. [Introduction] The introduction introduces 'context stickiness' informally; a short formal definition or equation quantifying the interference (e.g., error as function of prefix length) would improve clarity.
  2. [Abstract] Abstract and §4: The 'surprisingly' qualifier on random curriculum results is not supported by a direct comparison figure or table reference, making the surprise claim harder to evaluate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments have prompted us to improve the reproducibility and statistical rigor of the manuscript. We address each major comment below and have made corresponding revisions.

read point-by-point responses
  1. Referee: [§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.

    Authors: We agree that these details were insufficient in the original submission. The revised manuscript expands Section 3 and adds Appendix A with full specifications: a 4-layer transformer with 8 attention heads and embedding size 256; Adam optimizer with learning rate 1e-4, batch size 64, and 50k training steps; prompts consisting of 10-20 examples (approximately 200-400 tokens); results aggregated as mean over 5 independent random seeds with standard deviation; and data generation procedures for linear/quadratic functions. These additions enable full reproduction of all experiments. revision: yes

  2. Referee: [§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.

    Authors: We acknowledge the omission of variability measures. All figures in the revised Section 4 now display error bars as mean ± one standard deviation across the 5 seeds. We have added a statistical analysis subsection reporting paired t-tests comparing conditions with varying numbers of linear examples (all p < 0.01 for the reported degradations) and confirming diminishing returns in recovery. The text has been updated to reference these statistics when stating trends. revision: yes

  3. Referee: [§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.

    Authors: We agree that the synthetic setting does not automatically generalize to natural-language ICL and have revised the abstract and Section 5 to explicitly frame the work as a controlled study of interference mechanisms rather than a direct claim about LLMs. The discussion now includes a dedicated limitations paragraph acknowledging the gap and outlining why synthetic tasks enable isolation of effects not feasible in high-dimensional language data. We have not added new scaling or LLM experiments, as they fall outside the paper's scope of providing mechanistic insights via precise synthetic controls. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from synthetic experiments

full rationale

This is a purely empirical paper that reports controlled experiments on synthetic linear/quadratic regression tasks with different training curricula and abrupt task switches. No mathematical derivation chain, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methods. The central observations (persistent interference, curriculum effects) are direct measurements from the experimental setup rather than reductions of outputs to inputs by construction. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that synthetic linear/quadratic regression tasks model relevant ICL dynamics and that prediction error differences reflect interference rather than other factors.

axioms (1)
  • domain assumption Synthetic regression tasks over linear and quadratic functions capture key aspects of interference in transformer in-context learning.
    The entire experimental design and interpretation of results depend on this assumption about task representativeness.
invented entities (1)
  • context stickiness no independent evidence
    purpose: To name and frame the observed persistent interference from earlier prompt examples.
    The term is introduced to describe the main phenomenon without independent evidence outside the experiments.

pith-pipeline@v0.9.0 · 5437 in / 1335 out tokens · 41401 ms · 2026-05-08T08:28:03.757936+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 24 canonical work pages · 6 internal anchors

  1. [1]

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou.What learning algorithm is in-context learning? Investigations with linear models. 2023. arXiv: 2211 . 15661 [cs.LG].URL:https://arxiv.org/abs/2211.15661

  2. [2]

    Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob Oh, Siddharth Dalmia, and Prateek Kolhar.Revisiting In-Context Learning with Long Context Language Models. 2025. arXiv: 2412.16926 [cs.CL].URL:https://arxiv.org/abs/2412.16926

  3. [3]

    Transformers as statisticians: Provable in-context learning with in-context algorithm selection.ArXiv, abs/2306.04637, 2023

    Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. “Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection”. In: (2023). arXiv: 2306. 04637 [cs.LG].URL:https://arxiv.org/abs/2306.04637

  4. [4]

    Nested Learn- ing: The Illusion of Deep Learning Architectures

    Ali Behrouz, Meisam Razaviyayn, Peiling Zhong, and Vahab Mirrokni. “Nested Learn- ing: The Illusion of Deep Learning Architectures”. In: (2025). NeurIPS.URL: https : //openreview.net/pdf?id=nbMeRvNb7A

  5. [5]

    arXiv preprint arXiv:2405.00200 , year=

    Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. “In-Context Learning with Long-Context Models: An In-Depth Exploration”. In: (2025). arXiv: 2405.00200 [cs.CL] .URL: https://arxiv.org/abs/2405. 00200

  6. [6]

    How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

    Harmon Bhasin, Timothy Ossowski, Yiqiao Zhong, and Junjie Hu. “How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes”. In: (2024). arXiv: 2404.03558 [cs.CL] .URL: https://arxiv.org/abs/2404. 03558

  7. [7]

    Language Models are Few-Shot Learners

    Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: (2020). arXiv: 2005. 14165 [cs.CL].URL:https://arxiv.org/abs/2005.14165

  8. [8]

    In-context Interference in Chat-based Large Language Models

    Eric Nuertey Coleman, Julio Hurtado, and Vincenzo Lomonaco. “In-context Interference in Chat-based Large Language Models”. In: (2023). arXiv: 2309 . 12727 [cs.AI].URL: https://arxiv.org/abs/2309.12727

  9. [9]

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei.Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta- Optimizers. 2023. arXiv: 2212.10559 [cs.CL].URL: https://arxiv.org/abs/ 2212.10559

  10. [10]

    A Survey on In-context Learning

    Qingxiu Dong et al. “A Survey on In-context Learning”. In: (2024). arXiv: 2301.00234 [cs.CL].URL:https://arxiv.org/abs/2301.00234

  11. [11]

    arXiv preprint arXiv:2208.01066 , year=

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”. In: (2023). arXiv: 2208 . 01066 [cs.CL].URL:https://arxiv.org/abs/2208.01066

  12. [12]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models”. In: (2020). arXiv: 2001.08361 [cs.LG].URL: https://arxiv. org/abs/2001.08361

  13. [13]

    Task Diversity Shortens the ICL Plateau

    Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, and Ernest K. Ryu. “Task Diversity Shortens the ICL Plateau”. In: (2025). arXiv:2410.05448 [cs.LG].URL:https://arxiv.org/abs/2410.05448

  14. [14]

    Order Matters: Rethinking Prompt Construction in In-Context Learning

    Warren Li, Yiqian Wang, Zihan Wang, and Jingbo Shang. “Order Matters: Rethinking Prompt Construction in In-Context Learning”. In: (2025). arXiv: 2511.09700 [cs.CL] .URL: https://arxiv.org/abs/2511.09700

  15. [15]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the Middle: How Language Models Use Long Contexts”. In: (2023). arXiv:2307.03172 [cs.CL].URL:https://arxiv.org/abs/2307.03172

  16. [16]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity”. In: (2022). arXiv: 2104.08786 [cs.CL] .URL: https://arxiv.org/abs/2104. 08786

  17. [17]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” In: (2022). arXiv:2202.12837 [cs.CL].URL: https://arxiv.org/abs/ 2202.12837. 13

  18. [18]

    Catherine Olsson et al.In-context Learning and Induction Heads. 2022. arXiv: 2209.11895 [cs.LG].URL:https://arxiv.org/abs/2209.11895

  19. [19]

    arXiv preprint arXiv:2212.07677 , title =

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. “Transformers learn in-context by gradient descent”. In: (2023). arXiv: 2212.07677 [cs.LG] .URL: https://arxiv. org/abs/2212.07677

  20. [20]

    What In-Context Learning

    Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. “What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning”. In: (2023). arXiv: 2305. 09731 [cs.CL].URL:https://arxiv.org/abs/2305.09731

  21. [21]

    Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli.Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. 2023. arXiv: 2306.15063 [cs.LG].URL:https://arxiv.org/abs/2306.15063

  22. [22]

    2024 , month = feb, number =

    Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau.Function V ectors in Large Language Models. 2024. arXiv: 2310.15213 [cs.CL] . URL:https://arxiv.org/abs/2310.15213

  23. [23]

    An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. “An Explanation of In- context Learning as Implicit Bayesian Inference”. In: (2022). arXiv:2111.02080 [cs.CL]. URL:https://arxiv.org/abs/2111.02080

  24. [24]

    arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010

    Kayo Yin and Jacob Steinhardt.Which Attention Heads Matter for In-Context Learning?2025. arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010

  25. [25]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. “Calibrate Before Use: Improving Few-Shot Performance of Language Models”. In: (2021). arXiv: 2102.09690 [cs.CL].URL:https://arxiv.org/abs/2102.09690. A Acknowledgments We thank the UC Berkeley staff and fellow students for feedback that improved the paper. In particular, we thank Prof. An...