Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
Pith reviewed 2026-05-18 01:42 UTC · model grok-4.3
The pith
An explicit optimal attention temperature minimizes ICL generalization error under distribution shift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the high-dimensional linear-regression framework with approximate softmax attention, the ICL generalization error under distribution shift has a closed-form expression that is minimized by an explicit optimal attention temperature derived from moments of the pre-softmax attention scores; this choice recovers near Bayes-optimal performance in suitable regimes.
What carries the argument
The optimal attention temperature, obtained by minimizing the closed-form ICL generalization error and expressed in terms of moments of the pre-softmax attention scores.
If this is right
- The generalization error reaches a minimum at the temperature tied to the first and second moments of the pre-softmax scores.
- Temperature tuning can restore performance close to the Bayes-optimal estimator when the shift satisfies the derived conditions.
- The same temperature adjustment improves robustness on pretrained models when in-context examples contain noise.
- The closed-form error expression supplies concrete guidance for selecting the temperature from observable attention statistics.
Where Pith is reading between the lines
- The same temperature selection rule could be tested on attention variants that depart from the approximate softmax used here.
- A validation procedure that estimates the required moments from a small held-out shifted set might make the method directly usable in practice.
- The framework suggests examining whether temperature interacts with other inference-time controls such as demonstration ordering or length.
- Extending the linear model to include mild nonlinearity would test how far the optimal-temperature prediction continues to hold.
Load-bearing premise
The analysis relies on a high-dimensional linear regression setup together with an approximate softmax attention that keeps normalization and temperature selectivity but stays mathematically tractable.
What would settle it
A direct check in the linear regression simulations of whether the predicted optimal temperature produces the lowest measured ICL generalization error under the modeled distribution shift.
Figures
read the original abstract
Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near Bayes-optimal performance. We validate the theory with extensive simulations, and further demonstrate gains on pretrained LLMs (GPT-2 and Llama2-7B) on question-answering benchmarks under distribution shift induced by noisy in-context demonstrations. Overall, attention temperature emerges as a principled, lightweight knob for improving the robustness of ICL in pretrained Transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that in a high-dimensional linear-regression framework with an approximate softmax attention mechanism, the ICL generalization error under distribution shift admits a closed-form expression that is minimized by an explicit optimal attention temperature expressed in terms of moments of the pre-softmax attention scores. This optimum is predicted to recover near Bayes-optimal performance under suitable conditions. The theory is validated through simulations and demonstrated on pretrained LLMs (GPT-2 and Llama2-7B) for question-answering tasks when in-context demonstrations are corrupted by noise.
Significance. If the central claims hold, the work supplies a simple, theoretically motivated inference-time knob for improving ICL robustness under distribution shift, together with interpretable guidance based on observable attention-score moments. The closed-form derivation, its direct minimization, and the combination of synthetic validation with experiments on real LLMs constitute clear strengths. The result addresses a practically relevant failure mode of pretrained Transformers without requiring retraining or architectural changes.
major comments (1)
- [Abstract and derivation of ICL error] Abstract and the section deriving the ICL error: the closed-form generalization error and its minimizing temperature are obtained under an approximate softmax that replaces the true exponential while preserving normalization and selectivity. Because the reported optimum is an explicit function of moments of the pre-softmax scores, any systematic distortion of those moments or of the error surface by the approximation directly affects whether the derived temperature remains optimal for the exact softmax used in the GPT-2 and Llama2-7B experiments. The manuscript does not quantify the modeling gap or show that the location of the minimum is preserved, which is load-bearing for the claim that temperature adjustment recovers near Bayes-optimal performance.
minor comments (1)
- The connection between the theoretical moments and the quantities that can be estimated from a single forward pass on the test prompt could be stated more explicitly to make the practical guidance easier to implement.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address the single major comment below and describe the revisions that will be incorporated to strengthen the connection between the theoretical derivation and the LLM experiments.
read point-by-point responses
-
Referee: Abstract and the section deriving the ICL error: the closed-form generalization error and its minimizing temperature are obtained under an approximate softmax that replaces the true exponential while preserving normalization and selectivity. Because the reported optimum is an explicit function of moments of the pre-softmax scores, any systematic distortion of those moments or of the error surface by the approximation directly affects whether the derived temperature remains optimal for the exact softmax used in the GPT-2 and Llama2-7B experiments. The manuscript does not quantify the modeling gap or show that the location of the minimum is preserved, which is load-bearing for the claim that temperature adjustment recovers near Bayes-optimal performance.
Authors: We agree that explicitly quantifying the modeling gap introduced by the approximate softmax is necessary to support the claim that the derived optimal temperature remains relevant for the exact softmax used in the GPT-2 and Llama2-7B experiments. The approximate softmax was introduced precisely to obtain a tractable closed-form expression while retaining normalization and the temperature-dependent selectivity of the true mechanism. In the revised manuscript we will add a dedicated subsection (and corresponding appendix) that (i) derives high-dimensional bounds on the difference between the approximate and exact attention weights, (ii) shows that the location of the minimum of the ICL error surface is preserved up to o(1) terms under the same moment conditions used in the main derivation, and (iii) reports additional simulations that directly compare the ICL error curves and the argmin temperature under both the approximate and exact softmax for parameter regimes matching the LLM experiments. These additions will make the link between theory and the observed gains on noisy in-context demonstrations explicit. revision: yes
Circularity Check
No circularity: closed-form derivation under explicit approximation
full rationale
The paper explicitly adopts an approximate softmax attention mechanism as a modeling choice to obtain tractability while preserving normalization and selectivity. It then derives a closed-form ICL generalization error expression under distribution shift in the high-dimensional linear regression setting and minimizes that expression to obtain the optimal temperature, which is characterized in terms of moments of the pre-softmax scores. This is a direct analytic minimization rather than a self-definitional loop, a fitted parameter renamed as prediction, or any load-bearing self-citation. The approximation is stated upfront and the resulting guidance is internal to the derived error surface, rendering the central claim self-contained within the paper's stated framework and assumptions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption High-dimensional linear-regression framework models the Transformer ICL behavior
- domain assumption Approximate softmax attention preserves normalization and temperature-dependent selectivity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ a linearized approximation of softmax attention ... preserves the essential temperature-dependent behavior ... derive closed-form generalization error ... optimal attention temperature τ_optimal = 2 Tr(AM11TF1M11) / Tr(A(F2M11 + M11TF2T))
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
linearized softmax attention ... tractable theoretical analysis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[2]
MIT Press, 2016.http: //www.deeplearningbook.org
Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016.http: //www.deeplearningbook.org. Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
work page 2016
-
[3]
Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song
ISSN 00063444, 14643510. Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song. Vision transformer for small-size datasets.arXiv preprint arXiv:2112.13492,
-
[4]
Junyang Lin, Xu Sun, Xuancheng Ren, Muyu Li, and Qi Su. Learning when to concentrate or divert attention: Self-adaptive attention temperature for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2985–2990,
work page 2018
-
[5]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,
work page 2022
-
[6]
Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka
https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Core Francisco Park, Ekdeep Singh Lubana, Itamar Pres, and Hidenori Tanaka. Competition dy- namics shape algorithmic phases of in-context learning.arXiv preprint arXiv:2412.01003,
-
[7]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Transformers: State-of- the-art natural language processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art...
work page 2020
-
[9]
Transformers: State-of-the-art natural language processing
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? InThe Twelfth International Conference on Learning Representations,
-
[10]
Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,
Zhenyu Wu, YaoXiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, and Zhiyong Wu. Openicl: An open-source framework for in-context learning.arXiv preprint arXiv:2303.02913,
-
[11]
i .(19) To determine the form of the posterior distribution, we complete the square in the exponent by collecting all terms involvingw. Expanding the exponent in the joint expression from above, we obtain: − 1 2σ2 ¯yT ¯y−2 ¯yT ¯Xw+w T ¯X T ¯Xw − 1 2 wTΣ−1 0 w−2µ T 0 Σ−1 0 w+µ T 0 Σ−1 0 µ0 .(20) Grouping the quadratic and linear terms inw, we arrive at: −1...
work page 2024
-
[12]
in Appendix H. Note that our inputs ¯xi are centered, i.e., ¯xi =x i − 1 l P i≤l xi, so their distribution isN(0,Σ x)as l→ ∞. Therefore, Lemma G.1 is directly applicable in our setting. 19 Preprint Next, we start the calculations of the expectations in (78) withE eeT as follows E eeT = 1 τ 2 M T 11E 1 l X i≤l−1 ¯xi ¯xT i v21 +v 22 1 l X i≤l−1 ¯xiϵi...
work page 1918
-
[13]
under a shift in input covariance (Figure 6). Consistent with prior work (Garg et al., 2022; Zhang et al., 2024), such shifts substantially degrade performance and can even induce non- monotonic generalization error with respect to context lengthl. Remarkably, applying the optimal temperature mitigates this nonmonotonicity and improves in-context generali...
work page 2022
-
[14]
under an input-covariance shift. GPT-2 exemplifies the Transformer architecture (Vaswani et al., 2017), combining multi-layer perceptrons with multi-head softmax self-attention. The model here is pretrained by Garg et al. (2022) on the linear regression tasks defined in (2). We consider a shift fromΣ train x =ItoΣ test x = 3I. The attention temperature at...
work page 2017
-
[15]
as implemented in HuggingFace (Wolf et al., 2020), leveraging the pretrained model of Garg et al. (2022). Training data match ours, while their training procedure differs slightly: the loss is auto-regressive, i.e., the average over the entire context sequence of lengthl=
work page 2020
-
[16]
We adopt the same embedding method as in Garg et al. (2022). The input dimension isd= 20, with 12 layers and 8 heads. All GPT-2 experiments run on an NVIDIA Tesla V100 GPU and complete in approximately 10 minutes. K.3 DETAILS OF THELLMEXPERIMENTS INFIGURE3 For our large language model experiments, we employ LLaMA2-7B (Touvron et al.,
work page 2022
-
[17]
We generate ICL problems following Gao et al
and the SCIQ dataset (Welbl et al., 2017), which contains science questions with supporting information. We generate ICL problems following Gao et al. (2024), selecting in-context demonstrations using the TopK retrieval technique (Liu et al.,
work page 2017
-
[18]
To simulate distribution shift, we follow Gao et al. (2024) and introduce noisy labels—incorrect but semantically related—to the in-context demonstrations (Appendix K.4). Table 2 gives an example. The noisy ratio denotes the fraction of demonstrations with noisy la- bels (e.g., 0.6 means 60% noisy). We modify and use the codebase of Gao et al. (2024), bui...
work page 2024
-
[19]
and OpenICL (Wu et al., 2023). All LLM experiments run on an NVIDIA A40 GPU; a single Monte Carlo run per plot in Figure 3 takes a few hours. K.4 WHY IN-CONTEXT DEMONSTRATIONS WITH NOISY LABELS AS AN EXAMPLE OF DISTRIBUTION SHIFT? The link between noisy labels in demonstrations and distribution shift may not be immediately ob- vious. Quantifying pretraini...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.