pith. sign in

arxiv: 2605.20730 · v1 · pith:UKKCQNEXnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

Pith reviewed 2026-05-21 05:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords in-context learningtask vectorsdistributional alignmentnext-token predictionlarge language modelslinear regressionmodel transfer
0
0 comments X

The pith

Task vectors improve when their next-token predictions are forced to match those of full in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that good task vectors are those whose output distributions line up with the distributions produced by ordinary in-context learning. It introduces d_NTP to measure how much the next-token probabilities diverge between the two approaches and shows that this divergence tracks downstream accuracy across multiple models and tasks. Motivated by the correlation, the authors derive a closed-form linear method called LTV that directly minimizes the divergence by regressing the effect of demonstrations. The resulting vectors raise average accuracy by 9.2 percent while cutting latency, and the same alignment idea transfers to regression tasks and across model sizes.

Core claim

We posit that task-vector inference should produce next-token probability distributions that align with those of standard in-context learning. We quantify the misalignment with the metric d_NTP and observe a strong negative correlation between d_NTP and task accuracy. We therefore construct Linear Task Vector by solving a closed-form linear regression that estimates demonstration effects so as to minimize d_NTP. This construction yields task vectors that outperform prior extraction methods on eight classification benchmarks and five language models, with a 9.2 percent average accuracy gain and lower inference cost.

What carries the argument

Linear Task Vector (LTV), a closed-form linear mapping obtained by regressing demonstration effects to minimize the next-token probability discrepancy d_NTP between task-vector and in-context inference.

If this is right

  • LTV raises average accuracy 9.2 percent over existing task-vector baselines on eight classification benchmarks.
  • The same vectors reduce inference latency relative to full in-context learning.
  • LTV also outperforms baselines on regression tasks.
  • Task vectors extracted from a larger model improve a smaller model's performance by 6.4 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If minimizing d_NTP generalizes, the same regression objective could be applied to other compression techniques such as attention-based or nonlinear task vectors.
  • The alignment view suggests that task vectors might be further improved by matching higher-order statistics beyond single next-token probabilities.
  • Cross-scale transfer results imply that task vectors could serve as portable adapters between model families of different sizes.

Load-bearing premise

The negative correlation between d_NTP and downstream accuracy will continue to hold for new tasks, models, and prompt formats.

What would settle it

Run LTV on a fresh suite of classification or regression tasks where the measured d_NTP-accuracy correlation is weak or reversed; if accuracy gains disappear, the design criterion fails.

Figures

Figures reproduced from arXiv: 2605.20730 by Jihoon Kwon, Jiwon Choi, Jy-yong Sohn.

Figure 1
Figure 1. Figure 1: Comparison of three inference modes. In zero-shot inference mode (left), the model predicts the next token yˆzs solely based on the test query xtest. In the In-Context Learning mode (middle), the model predicts the next token yˆicl based on the concatenation of demonstrations Z and the query xtest. In the task vector mode (right), the model predicts the next token yˆtv based on not only the query xtest, bu… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed metric dNTP(f;Z) in equation 9, which measures the quality of the task vector extraction method f. In the ICL mode, the model gets demonstrations Z and test query xtest together to estimate the probability distribution Picl for the next token (left). In the TV mode, the task vector v is injected in the hidden layer (instead of putting demonstrations Z in the input layer) to get the… view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between the proposed discrepancy metric [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our Linear Task Vector (LTV) method. Our method employs a linear mapping W that estimates the effect of demonstrations in the hidden space (hicl − hzs) from the hidden state hzs of the zero-shot inference mode via ridge regression. In the extraction phase (left), we use N unlabeled training queries {xj} N j=1 to define (1) the regression target matrix Y as the concatenation of N column vectors … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of dNTP across LTV and four baselines on eight benchmarks, tested on LLaMA￾3.1-8B. LTV consistently achieves the lowest dNTP across all benchmarks [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extraction time cost versus downstream accuracy of [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Extraction time cost versus downstream accuracy of [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper posits that task vectors should align the next-token predictive distribution with standard in-context learning (ICL), quantified via a new metric d_NTP measuring discrepancy in next-token probabilities. It reports a strong negative correlation between d_NTP and downstream accuracy, then introduces Linear Task Vector (LTV) as a closed-form linear regression to minimize d_NTP by estimating demonstration effects. Across eight classification benchmarks and five LLMs, LTV yields a 9.2% average accuracy gain over baselines with lower latency; additional results cover regression tasks and 6.4% gains from transferring task vectors from larger to smaller models.

Significance. If the distributional alignment criterion and the observed d_NTP-accuracy correlation prove robust, LTV supplies a principled, efficient design method for task vectors that goes beyond post-hoc accuracy tuning. The closed-form regression, latency reduction, and cross-scale transfer results would be useful contributions to ICL research. The central claims, however, rest on empirical patterns observed within the same set of benchmarks used both for correlation analysis and method evaluation.

major comments (3)
  1. The reported 9.2% average accuracy improvement (Abstract) is presented without statistical significance tests, error bars, or analysis of sensitivity to prompt variations or random seeds. This weakens confidence that the gains are stable rather than tied to particular experimental choices on the eight benchmarks.
  2. The construction of LTV relies on a linear mapping fitted to minimize d_NTP, yet the manuscript does not clarify whether this regression uses held-out tasks or the same eight classification benchmarks on which the d_NTP-accuracy correlation was measured. If the latter, the performance lift may be an artifact of fitting to the evaluation distribution rather than evidence that minimizing d_NTP reliably improves vectors on new tasks.
  3. The negative correlation between d_NTP and accuracy is used to justify LTV, but no out-of-distribution experiments test whether this correlation (and thus the benefit of the regression) persists for unseen task distributions, prompt formats, or model scales beyond the reported transfer result.
minor comments (2)
  1. The abstract states that LTV also outperforms baselines on regression tasks but supplies no quantitative numbers, specific datasets, or comparison tables; adding these details would improve completeness.
  2. Notation for d_NTP and the linear mapping matrix should be introduced with an explicit equation in the main text to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: The reported 9.2% average accuracy improvement (Abstract) is presented without statistical significance tests, error bars, or analysis of sensitivity to prompt variations or random seeds. This weakens confidence that the gains are stable rather than tied to particular experimental choices on the eight benchmarks.

    Authors: We agree that the lack of statistical tests and error bars reduces confidence in the stability of the reported gains. In the revised manuscript we will add error bars computed over multiple random seeds, report standard deviations, include paired statistical significance tests on the accuracy improvements, and provide an analysis of sensitivity to prompt template variations and seed choices in the experimental section. revision: yes

  2. Referee: The construction of LTV relies on a linear mapping fitted to minimize d_NTP, yet the manuscript does not clarify whether this regression uses held-out tasks or the same eight classification benchmarks on which the d_NTP-accuracy correlation was measured. If the latter, the performance lift may be an artifact of fitting to the evaluation distribution rather than evidence that minimizing d_NTP reliably improves vectors on new tasks.

    Authors: The linear regression for LTV is performed independently for each task using only the in-context demonstrations of that task; the d_NTP-accuracy correlation was obtained by evaluating multiple task-vector methods across the same set of benchmarks. Because test accuracy is measured on held-out test examples separate from the demonstrations, the procedure does not fit to the evaluation distribution. We will revise the methods section to make this per-task, demonstration-only fitting explicit and to clarify the separation between demonstration data and test evaluation. revision: yes

  3. Referee: The negative correlation between d_NTP and accuracy is used to justify LTV, but no out-of-distribution experiments test whether this correlation (and thus the benefit of the regression) persists for unseen task distributions, prompt formats, or model scales beyond the reported transfer result.

    Authors: We acknowledge that broader out-of-distribution validation would strengthen the claim that minimizing d_NTP is a reliable design principle. Our existing transfer results across model scales already provide some evidence of generalization. In the revision we will add experiments on additional unseen tasks and prompt-format variations to test whether the d_NTP-accuracy correlation and the advantage of LTV hold under these shifts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines d_NTP independently as the discrepancy in next-token probabilities between task-vector and ICL inference. It reports an empirical negative correlation between d_NTP and downstream accuracy on the eight benchmarks. LTV is then derived as a closed-form linear regressor whose explicit objective is to minimize d_NTP; accuracy is never an input to the regression. Because the fitted quantity (d_NTP) is distinct from the evaluation metric (accuracy) and the construction does not rename a fit to accuracy as a prediction of accuracy, no step reduces to its own inputs by definition. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The central claim therefore rests on external empirical validation rather than circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claims rest on the empirical correlation between d_NTP and accuracy plus the assumption that a linear map fitted to demonstration effects will generalize; no new physical or mathematical axioms are introduced.

free parameters (1)
  • linear mapping matrix
    Coefficients of the closed-form linear map are estimated by regression on demonstration effects for each task.

pith-pipeline@v0.9.0 · 5812 in / 1282 out tokens · 33432 ms · 2026-05-21T05:28:23.794799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 11 internal anchors

  1. [1]

    Deep Learning using Rectified Linear Units (ReLU)

    Abien Fred Agarap. Deep learning using rectified linear units (relu).arXiv preprint arXiv:1803.08375, 2018

  2. [2]

    Many-shot in-context learning

    Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. Advances in Neural Information Processing Systems, 37:76930–76966, 2024

  3. [3]

    Transformers learn to imple- ment preconditioned gradient descent for in-context learning.Advances in Neural Information Processing Systems, 36:45614–45650, 2023

    Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to imple- ment preconditioned gradient descent for in-context learning.Advances in Neural Information Processing Systems, 36:45614–45650, 2023

  4. [4]

    What learning algorithm is in-context learning? Investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661, 2022

  5. [5]

    Task prompt vectors: Effective initialization through multi-task soft prompt transfer

    Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. Task prompt vectors: Effective initialization through multi-task soft prompt transfer. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 77–94. Springer, 2025

  6. [6]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  7. [7]

    Semeval- 2019 task 3: Emocontext contextual emotion detection in text

    Ankush Chatterjee, Kedhar Nath Narahari, Meghana Joshi, and Puneet Agrawal. Semeval- 2019 task 3: Emocontext contextual emotion detection in text. InProceedings of the 13th international workshop on semantic evaluation, pages 39–48, 2019

  8. [8]

    Hate Speech Dataset from a White Supremacy Forum

    Ona De Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. Hate speech dataset from a white supremacy forum.arXiv preprint arXiv:1809.04444, 2018

  9. [9]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

  10. [10]

    Understanding task vectors in in-context learning: Emergence, functionality, and limitations.arXiv preprint arXiv:2506.09048, 2025

    Yuxin Dong, Jiachen Jiang, Zhihui Zhu, and Xia Ning. Understanding task vectors in in-context learning: Emergence, functionality, and limitations.arXiv preprint arXiv:2506.09048, 2025

  11. [11]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  12. [12]

    Boyan Gao, Xin Wang, Yibo Yang, and David A. Clifton. Optimization inspired few-shot adap- tation for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=rZ2nSt1X58

  13. [13]

    What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

  14. [14]

    Towards compute-optimal many-shot in-context learning

    Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Towards compute-optimal many-shot in-context learning. InSecond Conference on Language Modeling, 2025

  15. [15]

    Emergence and effectiveness of task vectors in in-context learning: An encoder decoder perspective

    Seungwook Han, Jinyeop Song, Jeff Gore, and Pulkit Agrawal. Emergence and effectiveness of task vectors in in-context learning: An encoder decoder perspective. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=0ysC6VS0y3

  16. [16]

    In-context learning creates task vectors

    Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  17. [17]

    Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

    Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. 10

  18. [18]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

  19. [19]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023

  20. [20]

    Disentangling latent shifts of in-context learning with weak supervision.arXiv preprint arXiv:2410.01508, 2024

    Josip Juki ´c and Jan Šnajder. Disentangling latent shifts of in-context learning with weak supervision.arXiv preprint arXiv:2410.01508, 2024

  21. [21]

    Adaptive task vectors for large language models.arXiv preprint arXiv:2506.03426, 2025

    Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, and Kyungwoo Song. Adaptive task vectors for large language models.arXiv preprint arXiv:2506.03426, 2025

  22. [22]

    On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

    Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

  23. [23]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

  24. [24]

    In-context learning state vector with inner and momentum optimization.Advances in Neural Information Processing Systems, 37:7797–7820, 2024

    Dongfang Li, Zhenyu Liu, Xinshuo Hu, Zetian Sun, Baotian Hu, and Min Zhang. In-context learning state vector with inner and momentum optimization.Advances in Neural Information Processing Systems, 37:7797–7820, 2024

  25. [25]

    When is task vector provably effective for model editing? a generalization analysis of nonlinear transformers.arXiv preprint arXiv:2504.10957, 2025

    Hongkang Li, Yihua Zhang, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. When is task vector provably effective for model editing? a generalization analysis of nonlinear transformers.arXiv preprint arXiv:2504.10957, 2025

  26. [26]

    Towards generalizable implicit in-context learning with attention routing.arXiv preprint arXiv:2509.22854, 2025

    Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, and Wenya Wang. Towards generalizable implicit in-context learning with attention routing.arXiv preprint arXiv:2509.22854, 2025

  27. [27]

    Trans- formers as algorithms: Generalization and stability in in-context learning

    Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023

  28. [28]

    Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, and Dimitris N. Metaxas. Implicit in-context learning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=G7u4ue6ncT

  29. [29]

    In-context vectors: Making in context learning more effective and controllable through latent space steering

    Sheng Liu, Haotian Ye, Lei Xing, and James Y Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. InForty-first International Conference on Machine Learning, 2024

  30. [30]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  31. [31]

    Does learning the right latent variables necessarily improve in-context learning? In Forty-second International Conference on Machine Learning, 2025

    Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, and Dhanya Sridhar. Does learning the right latent variables necessarily improve in-context learning? In Forty-second International Conference on Machine Learning, 2025

  32. [32]

    Learning to compress prompts with gist tokens

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36:19327–19352, 2023

  33. [33]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

  34. [34]

    Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023

    Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023. 11

  35. [35]

    A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts

    Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.arXiv preprint cs/0409058, 2004

  36. [36]

    Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

    Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.arXiv preprint cs/0506075, 2005

  37. [37]

    In-context learning through the bayesian prism

    Madhur Panwar, Kabir Ahuja, and Navin Goyal. In-context learning through the bayesian prism. arXiv preprint arXiv:2306.04891, 2023

  38. [38]

    A Yang Qwen, Baosong Yang, B Zhang, B Hui, B Zheng, B Yu, Chengpeng Li, D Liu, F Huang, H Wei, et al. Qwen2. 5 technical report.arXiv preprint, 2024

  39. [39]

    Learning task representations from in-context learning

    Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, and Amin Karbasi. Learning task representations from in-context learning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6634–6663, 2025

  40. [40]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

  41. [41]

    Function vectors in large language models

    Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  43. [43]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  44. [44]

    Transformers learn in-context by gradient descent

    Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

  45. [45]

    Building a question answering test collection

    Ellen M V oorhees and Dawn M Tice. Building a question answering test collection. InProceed- ings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207, 2000

  46. [46]

    Elicit: Llm augmentation via external in-context capability

    Futing Wang, Jianhao Yan, Yue Zhang, and Tao Lin. Elicit: Llm augmentation via external in-context capability. InThe Thirteenth International Conference on Learning Representations, 2025

  47. [47]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  48. [48]

    An Explanation of In-context Learning as Implicit Bayesian Inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    Task vectors in in-context learning: Emergence, formation, and benefit.CoRR, 2025

    Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, and Robert D Nowak. Task vectors in in-context learning: Emergence, formation, and benefit.CoRR, 2025

  51. [51]

    Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010, 2025

    Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010, 2025. 12

  52. [52]

    Knowledge composition using task vectors with learned anisotropic scaling

    Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. Advances in Neural Information Processing Systems, 37:67319–67354, 2024

  53. [53]

    Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

  54. [54]

    What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

    Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023

  55. [55]

    The mystery of in-context learning: A comprehensive survey on interpretation and analysis

    Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. The mystery of in-context learning: A comprehensive survey on interpretation and analysis. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14365– 14378, 2024. 13 A Appendix A.1 Limitation and Future Work We describe several limitati...