Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

Samet Demir; Zafer Dogan

arxiv: 2511.01292 · v2 · submitted 2025-11-03 · 📊 stat.ML · cs.LG

Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

Samet Demir , Zafer Dogan This is my paper

Pith reviewed 2026-05-18 01:42 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords in-context learningattention temperaturedistribution shiftgeneralization errorhigh-dimensional regressiontransformer robustnessapproximate softmax

0 comments

The pith

An explicit optimal attention temperature minimizes ICL generalization error under distribution shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that attention temperature, adjusted at inference time, can minimize the error of in-context learning when test inputs come from a different distribution than the pretraining data. The analysis uses a tractable high-dimensional linear regression model with an approximate softmax that keeps temperature's effect on selectivity. If correct, this gives a direct formula for the best temperature based on moments of the pre-softmax scores and indicates when the adjustment can reach near-optimal performance without retraining. The result is checked in simulations and on actual models including GPT-2 and Llama2-7B facing noisy demonstrations.

Core claim

In the high-dimensional linear-regression framework with approximate softmax attention, the ICL generalization error under distribution shift has a closed-form expression that is minimized by an explicit optimal attention temperature derived from moments of the pre-softmax attention scores; this choice recovers near Bayes-optimal performance in suitable regimes.

What carries the argument

The optimal attention temperature, obtained by minimizing the closed-form ICL generalization error and expressed in terms of moments of the pre-softmax attention scores.

If this is right

The generalization error reaches a minimum at the temperature tied to the first and second moments of the pre-softmax scores.
Temperature tuning can restore performance close to the Bayes-optimal estimator when the shift satisfies the derived conditions.
The same temperature adjustment improves robustness on pretrained models when in-context examples contain noise.
The closed-form error expression supplies concrete guidance for selecting the temperature from observable attention statistics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temperature selection rule could be tested on attention variants that depart from the approximate softmax used here.
A validation procedure that estimates the required moments from a small held-out shifted set might make the method directly usable in practice.
The framework suggests examining whether temperature interacts with other inference-time controls such as demonstration ordering or length.
Extending the linear model to include mild nonlinearity would test how far the optimal-temperature prediction continues to hold.

Load-bearing premise

The analysis relies on a high-dimensional linear regression setup together with an approximate softmax attention that keeps normalization and temperature selectivity but stays mathematically tractable.

What would settle it

A direct check in the linear regression simulations of whether the predicted optimal temperature produces the lowest measured ICL generalization error under the modeled distribution shift.

Figures

Figures reproduced from arXiv: 2511.01292 by Samet Demir, Zafer Dogan.

**Figure 2.** Figure 2: Effect of noise shift on Transformer (5). The pretraining noise is [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of attention temperature on the ICL performance of LLaMA-2-7B (Touvron et al., [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of linear and linearized attention under a shift in input mean. The plot il [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of temperature effects of softmax, linearized softmax, and linear (with [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: GPT-2 (Radford et al., 2019) under an input-covariance shift. GPT-2 exemplifies the [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near Bayes-optimal performance. We validate the theory with extensive simulations, and further demonstrate gains on pretrained LLMs (GPT-2 and Llama2-7B) on question-answering benchmarks under distribution shift induced by noisy in-context demonstrations. Overall, attention temperature emerges as a principled, lightweight knob for improving the robustness of ICL in pretrained Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a closed-form optimal attention temperature tied to pre-softmax moments that minimizes ICL error under shift in a linear model, but the approximate softmax creates a gap with real attention that the experiments do not fully close.

read the letter

The main thing to know is that this work gives an explicit formula for the best attention temperature in a high-dimensional linear regression version of ICL, expressed directly in terms of moments of the pre-softmax scores, and shows it can push performance close to Bayes optimal under distribution shift. That link is new relative to the usual empirical temperature sweeps in the ICL literature they cite. They also run simulations that track the theory and test the idea on GPT-2 and Llama2-7B with noisy in-context examples, where temperature adjustment yields measurable gains on QA benchmarks. That combination of derivation plus real-model checks is the part that stands out as useful. The derivation itself looks clean and the error expression is obtained by minimizing the derived quantity rather than fitting to outcomes, which avoids obvious circularity. The linear setup lets them keep the math tractable while still capturing high-dimensional effects and distribution shift. The soft spot is the approximate softmax they introduce for tractability. It keeps normalization and temperature selectivity but swaps out the true exponential, so the moments that determine the optimal temperature could be distorted relative to standard attention. The LLM experiments use the exact mechanism, yet the theory and the claimed recovery of near-optimal performance rest on the surrogate; the paper does not appear to measure how large that distortion is on the error surface. The high-dimensional linear regression framework is a reasonable modeling choice for analysis, but it leaves the usual questions about how much carries over once you add the full nonlinearities, multi-head structure, and pretraining dynamics of actual Transformers. This paper is for readers who want a principled, inference-time knob for ICL robustness rather than retraining. Someone working on deployment under shift or on theoretical characterizations of attention would find the closed-form guidance and the moment-based interpretation worth their time. It has enough formal grounding and external checks to merit sending out for serious refereeing, even if the approximation needs tighter validation in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that in a high-dimensional linear-regression framework with an approximate softmax attention mechanism, the ICL generalization error under distribution shift admits a closed-form expression that is minimized by an explicit optimal attention temperature expressed in terms of moments of the pre-softmax attention scores. This optimum is predicted to recover near Bayes-optimal performance under suitable conditions. The theory is validated through simulations and demonstrated on pretrained LLMs (GPT-2 and Llama2-7B) for question-answering tasks when in-context demonstrations are corrupted by noise.

Significance. If the central claims hold, the work supplies a simple, theoretically motivated inference-time knob for improving ICL robustness under distribution shift, together with interpretable guidance based on observable attention-score moments. The closed-form derivation, its direct minimization, and the combination of synthetic validation with experiments on real LLMs constitute clear strengths. The result addresses a practically relevant failure mode of pretrained Transformers without requiring retraining or architectural changes.

major comments (1)

[Abstract and derivation of ICL error] Abstract and the section deriving the ICL error: the closed-form generalization error and its minimizing temperature are obtained under an approximate softmax that replaces the true exponential while preserving normalization and selectivity. Because the reported optimum is an explicit function of moments of the pre-softmax scores, any systematic distortion of those moments or of the error surface by the approximation directly affects whether the derived temperature remains optimal for the exact softmax used in the GPT-2 and Llama2-7B experiments. The manuscript does not quantify the modeling gap or show that the location of the minimum is preserved, which is load-bearing for the claim that temperature adjustment recovers near Bayes-optimal performance.

minor comments (1)

The connection between the theoretical moments and the quantities that can be estimated from a single forward pass on the test prompt could be stated more explicitly to make the practical guidance easier to implement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address the single major comment below and describe the revisions that will be incorporated to strengthen the connection between the theoretical derivation and the LLM experiments.

read point-by-point responses

Referee: Abstract and the section deriving the ICL error: the closed-form generalization error and its minimizing temperature are obtained under an approximate softmax that replaces the true exponential while preserving normalization and selectivity. Because the reported optimum is an explicit function of moments of the pre-softmax scores, any systematic distortion of those moments or of the error surface by the approximation directly affects whether the derived temperature remains optimal for the exact softmax used in the GPT-2 and Llama2-7B experiments. The manuscript does not quantify the modeling gap or show that the location of the minimum is preserved, which is load-bearing for the claim that temperature adjustment recovers near Bayes-optimal performance.

Authors: We agree that explicitly quantifying the modeling gap introduced by the approximate softmax is necessary to support the claim that the derived optimal temperature remains relevant for the exact softmax used in the GPT-2 and Llama2-7B experiments. The approximate softmax was introduced precisely to obtain a tractable closed-form expression while retaining normalization and the temperature-dependent selectivity of the true mechanism. In the revised manuscript we will add a dedicated subsection (and corresponding appendix) that (i) derives high-dimensional bounds on the difference between the approximate and exact attention weights, (ii) shows that the location of the minimum of the ICL error surface is preserved up to o(1) terms under the same moment conditions used in the main derivation, and (iii) reports additional simulations that directly compare the ICL error curves and the argmin temperature under both the approximate and exact softmax for parameter regimes matching the LLM experiments. These additions will make the link between theory and the observed gains on noisy in-context demonstrations explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: closed-form derivation under explicit approximation

full rationale

The paper explicitly adopts an approximate softmax attention mechanism as a modeling choice to obtain tractability while preserving normalization and selectivity. It then derives a closed-form ICL generalization error expression under distribution shift in the high-dimensional linear regression setting and minimizes that expression to obtain the optimal temperature, which is characterized in terms of moments of the pre-softmax scores. This is a direct analytic minimization rather than a self-definitional loop, a fitted parameter renamed as prediction, or any load-bearing self-citation. The approximation is stated upfront and the resulting guidance is internal to the derived error surface, rendering the central claim self-contained within the paper's stated framework and assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the high-dimensional linear regression modeling choice and the tractable approximate softmax; these are standard domain assumptions rather than new postulates.

axioms (2)

domain assumption High-dimensional linear-regression framework models the Transformer ICL behavior
Enables the closed-form generalization error derivation under distribution shift.
domain assumption Approximate softmax attention preserves normalization and temperature-dependent selectivity
Allows analytic tractability while approximating real attention mechanisms.

pith-pipeline@v0.9.0 · 5726 in / 1231 out tokens · 28383 ms · 2026-05-18T01:42:44.127797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a linearized approximation of softmax attention ... preserves the essential temperature-dependent behavior ... derive closed-form generalization error ... optimal attention temperature τ_optimal = 2 Tr(AM11TF1M11) / Tr(A(F2M11 + M11TF2T))
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linearized softmax attention ... tractable theoretical analysis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

MIT Press, 2016.http: //www.deeplearningbook.org

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016.http: //www.deeplearningbook.org. Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page 2016
[3]

Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song

ISSN 00063444, 14643510. Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song. Vision transformer for small-size datasets.arXiv preprint arXiv:2112.13492,

work page arXiv
[4]

Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation

Junyang Lin, Xu Sun, Xuancheng Ren, Muyu Li, and Qi Su. Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2985–2990,

work page 2018
[5]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,

work page 2022
[6]

Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka. Competition dy- namics shape algorithmic phases of in-context learning.arXiv preprint arXiv:2412.01003,

work page arXiv 2022
[7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art...

work page 2020
[9]

Transformers: State-of-the-art natural language processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? InThe Twelfth International Conference on Learning Representations,

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[10]

Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,

Zhenyu Wu, YaoXiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, and Zhiyong Wu. Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,

work page arXiv
[11]

i .(19) To determine the form of the posterior distribution, we complete the square in the exponent by collecting all terms involvingw. Expanding the exponent in the joint expression from above, we obtain: − 1 2σ2 ¯yT ¯y−2 ¯yT ¯Xw+w T ¯X T ¯Xw − 1 2 wTΣ−1 0 w−2µ T 0 Σ−1 0 w+µ T 0 Σ−1 0 µ0 .(20) Grouping the quadratic and linear terms inw, we arrive at: −1...

work page 2024
[12]

Note that our inputs ¯xi are centered, i.e., ¯xi =x i − 1 l P i≤l xi, so their distribution isN(0,Σ x)as l→ ∞

in Appendix H. Note that our inputs ¯xi are centered, i.e., ¯xi =x i − 1 l P i≤l xi, so their distribution isN(0,Σ x)as l→ ∞. Therefore, Lemma G.1 is directly applicable in our setting. 19 Preprint Next, we start the calculations of the expectations in (78) withE eeT as follows E eeT = 1 τ 2 M T 11E    1 l X i≤l−1 ¯xi ¯xT i v21 +v 22 1 l X i≤l−1 ¯xiϵi...

work page 1918
[13]

under a shift in input covariance (Figure 6). Consistent with prior work (Garg et al., 2022; Zhang et al., 2024), such shifts substantially degrade performance and can even induce non- monotonic generalization error with respect to context lengthl. Remarkably, applying the optimal temperature mitigates this nonmonotonicity and improves in-context generali...

work page 2022
[14]

GPT-2 exemplifies the Transformer architecture (Vaswani et al., 2017), combining multi-layer perceptrons with multi-head softmax self-attention

under an input-covariance shift. GPT-2 exemplifies the Transformer architecture (Vaswani et al., 2017), combining multi-layer perceptrons with multi-head softmax self-attention. The model here is pretrained by Garg et al. (2022) on the linear regression tasks defined in (2). We consider a shift fromΣ train x =ItoΣ test x = 3I. The attention temperature at...

work page 2017
[15]

as implemented in HuggingFace (Wolf et al., 2020), leveraging the pretrained model of Garg et al. (2022). Training data match ours, while their training procedure differs slightly: the loss is auto-regressive, i.e., the average over the entire context sequence of lengthl=

work page 2020
[16]

We adopt the same embedding method as in Garg et al. (2022). The input dimension isd= 20, with 12 layers and 8 heads. All GPT-2 experiments run on an NVIDIA Tesla V100 GPU and complete in approximately 10 minutes. K.3 DETAILS OF THELLMEXPERIMENTS INFIGURE3 For our large language model experiments, we employ LLaMA2-7B (Touvron et al.,

work page 2022
[17]

We generate ICL problems following Gao et al

and the SCIQ dataset (Welbl et al., 2017), which contains science questions with supporting information. We generate ICL problems following Gao et al. (2024), selecting in-context demonstrations using the TopK retrieval technique (Liu et al.,

work page 2017
[18]

(2024) and introduce noisy labels—incorrect but semantically related—to the in-context demonstrations (Appendix K.4)

To simulate distribution shift, we follow Gao et al. (2024) and introduce noisy labels—incorrect but semantically related—to the in-context demonstrations (Appendix K.4). Table 2 gives an example. The noisy ratio denotes the fraction of demonstrations with noisy la- bels (e.g., 0.6 means 60% noisy). We modify and use the codebase of Gao et al. (2024), bui...

work page 2024
[19]

All LLM experiments run on an NVIDIA A40 GPU; a single Monte Carlo run per plot in Figure 3 takes a few hours

and OpenICL (Wu et al., 2023). All LLM experiments run on an NVIDIA A40 GPU; a single Monte Carlo run per plot in Figure 3 takes a few hours. K.4 WHY IN-CONTEXT DEMONSTRATIONS WITH NOISY LABELS AS AN EXAMPLE OF DISTRIBUTION SHIFT? The link between noisy labels in demonstrations and distribution shift may not be immediately ob- vious. Quantifying pretraini...

work page 2023

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[2] [2]

MIT Press, 2016.http: //www.deeplearningbook.org

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016.http: //www.deeplearningbook.org. Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page 2016

[3] [3]

Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song

ISSN 00063444, 14643510. Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song. Vision transformer for small-size datasets.arXiv preprint arXiv:2112.13492,

work page arXiv

[4] [4]

Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation

Junyang Lin, Xu Sun, Xuancheng Ren, Muyu Li, and Qi Su. Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2985–2990,

work page 2018

[5] [5]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,

work page 2022

[6] [6]

Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka

https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka. Competition dy- namics shape algorithmic phases of in-context learning.arXiv preprint arXiv:2412.01003,

work page arXiv 2022

[7] [7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Transformers: State-of- the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art...

work page 2020

[9] [9]

Transformers: State-of-the-art natural language processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? InThe Twelfth International Conference on Learning Representations,

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[10] [10]

Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,

Zhenyu Wu, YaoXiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, and Zhiyong Wu. Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,

work page arXiv

[11] [11]

i .(19) To determine the form of the posterior distribution, we complete the square in the exponent by collecting all terms involvingw. Expanding the exponent in the joint expression from above, we obtain: − 1 2σ2 ¯yT ¯y−2 ¯yT ¯Xw+w T ¯X T ¯Xw − 1 2 wTΣ−1 0 w−2µ T 0 Σ−1 0 w+µ T 0 Σ−1 0 µ0 .(20) Grouping the quadratic and linear terms inw, we arrive at: −1...

work page 2024

[12] [12]

Note that our inputs ¯xi are centered, i.e., ¯xi =x i − 1 l P i≤l xi, so their distribution isN(0,Σ x)as l→ ∞

in Appendix H. Note that our inputs ¯xi are centered, i.e., ¯xi =x i − 1 l P i≤l xi, so their distribution isN(0,Σ x)as l→ ∞. Therefore, Lemma G.1 is directly applicable in our setting. 19 Preprint Next, we start the calculations of the expectations in (78) withE eeT as follows E eeT = 1 τ 2 M T 11E    1 l X i≤l−1 ¯xi ¯xT i v21 +v 22 1 l X i≤l−1 ¯xiϵi...

work page 1918

[13] [13]

under a shift in input covariance (Figure 6). Consistent with prior work (Garg et al., 2022; Zhang et al., 2024), such shifts substantially degrade performance and can even induce non- monotonic generalization error with respect to context lengthl. Remarkably, applying the optimal temperature mitigates this nonmonotonicity and improves in-context generali...

work page 2022

[14] [14]

GPT-2 exemplifies the Transformer architecture (Vaswani et al., 2017), combining multi-layer perceptrons with multi-head softmax self-attention

under an input-covariance shift. GPT-2 exemplifies the Transformer architecture (Vaswani et al., 2017), combining multi-layer perceptrons with multi-head softmax self-attention. The model here is pretrained by Garg et al. (2022) on the linear regression tasks defined in (2). We consider a shift fromΣ train x =ItoΣ test x = 3I. The attention temperature at...

work page 2017

[15] [15]

as implemented in HuggingFace (Wolf et al., 2020), leveraging the pretrained model of Garg et al. (2022). Training data match ours, while their training procedure differs slightly: the loss is auto-regressive, i.e., the average over the entire context sequence of lengthl=

work page 2020

[16] [16]

We adopt the same embedding method as in Garg et al. (2022). The input dimension isd= 20, with 12 layers and 8 heads. All GPT-2 experiments run on an NVIDIA Tesla V100 GPU and complete in approximately 10 minutes. K.3 DETAILS OF THELLMEXPERIMENTS INFIGURE3 For our large language model experiments, we employ LLaMA2-7B (Touvron et al.,

work page 2022

[17] [17]

We generate ICL problems following Gao et al

and the SCIQ dataset (Welbl et al., 2017), which contains science questions with supporting information. We generate ICL problems following Gao et al. (2024), selecting in-context demonstrations using the TopK retrieval technique (Liu et al.,

work page 2017

[18] [18]

(2024) and introduce noisy labels—incorrect but semantically related—to the in-context demonstrations (Appendix K.4)

To simulate distribution shift, we follow Gao et al. (2024) and introduce noisy labels—incorrect but semantically related—to the in-context demonstrations (Appendix K.4). Table 2 gives an example. The noisy ratio denotes the fraction of demonstrations with noisy la- bels (e.g., 0.6 means 60% noisy). We modify and use the codebase of Gao et al. (2024), bui...

work page 2024

[19] [19]

All LLM experiments run on an NVIDIA A40 GPU; a single Monte Carlo run per plot in Figure 3 takes a few hours

and OpenICL (Wu et al., 2023). All LLM experiments run on an NVIDIA A40 GPU; a single Monte Carlo run per plot in Figure 3 takes a few hours. K.4 WHY IN-CONTEXT DEMONSTRATIONS WITH NOISY LABELS AS AN EXAMPLE OF DISTRIBUTION SHIFT? The link between noisy labels in demonstrations and distribution shift may not be immediately ob- vious. Quantifying pretraini...

work page 2023