pith. sign in

arxiv: 2602.23197 · v2 · pith:RPRGYRJ7new · submitted 2026-02-26 · 💻 cs.CL · cs.LG· stat.ML

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

classification 💻 cs.CL cs.LGstat.ML
keywords in-contextlearningfine-tuningmodelstasksattentionperformancefew-shot
0
0 comments X
read the original abstract

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

    cs.LG 2026-05 unverdicted novelty 5.0

    FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.