Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
Pith reviewed 2026-05-21 05:28 UTC · model grok-4.3
The pith
Task vectors improve when their next-token predictions are forced to match those of full in-context learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We posit that task-vector inference should produce next-token probability distributions that align with those of standard in-context learning. We quantify the misalignment with the metric d_NTP and observe a strong negative correlation between d_NTP and task accuracy. We therefore construct Linear Task Vector by solving a closed-form linear regression that estimates demonstration effects so as to minimize d_NTP. This construction yields task vectors that outperform prior extraction methods on eight classification benchmarks and five language models, with a 9.2 percent average accuracy gain and lower inference cost.
What carries the argument
Linear Task Vector (LTV), a closed-form linear mapping obtained by regressing demonstration effects to minimize the next-token probability discrepancy d_NTP between task-vector and in-context inference.
If this is right
- LTV raises average accuracy 9.2 percent over existing task-vector baselines on eight classification benchmarks.
- The same vectors reduce inference latency relative to full in-context learning.
- LTV also outperforms baselines on regression tasks.
- Task vectors extracted from a larger model improve a smaller model's performance by 6.4 percent.
Where Pith is reading between the lines
- If minimizing d_NTP generalizes, the same regression objective could be applied to other compression techniques such as attention-based or nonlinear task vectors.
- The alignment view suggests that task vectors might be further improved by matching higher-order statistics beyond single next-token probabilities.
- Cross-scale transfer results imply that task vectors could serve as portable adapters between model families of different sizes.
Load-bearing premise
The negative correlation between d_NTP and downstream accuracy will continue to hold for new tasks, models, and prompt formats.
What would settle it
Run LTV on a fresh suite of classification or regression tasks where the measured d_NTP-accuracy correlation is weak or reversed; if accuracy gains disappear, the design criterion fails.
Figures
read the original abstract
In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper posits that task vectors should align the next-token predictive distribution with standard in-context learning (ICL), quantified via a new metric d_NTP measuring discrepancy in next-token probabilities. It reports a strong negative correlation between d_NTP and downstream accuracy, then introduces Linear Task Vector (LTV) as a closed-form linear regression to minimize d_NTP by estimating demonstration effects. Across eight classification benchmarks and five LLMs, LTV yields a 9.2% average accuracy gain over baselines with lower latency; additional results cover regression tasks and 6.4% gains from transferring task vectors from larger to smaller models.
Significance. If the distributional alignment criterion and the observed d_NTP-accuracy correlation prove robust, LTV supplies a principled, efficient design method for task vectors that goes beyond post-hoc accuracy tuning. The closed-form regression, latency reduction, and cross-scale transfer results would be useful contributions to ICL research. The central claims, however, rest on empirical patterns observed within the same set of benchmarks used both for correlation analysis and method evaluation.
major comments (3)
- The reported 9.2% average accuracy improvement (Abstract) is presented without statistical significance tests, error bars, or analysis of sensitivity to prompt variations or random seeds. This weakens confidence that the gains are stable rather than tied to particular experimental choices on the eight benchmarks.
- The construction of LTV relies on a linear mapping fitted to minimize d_NTP, yet the manuscript does not clarify whether this regression uses held-out tasks or the same eight classification benchmarks on which the d_NTP-accuracy correlation was measured. If the latter, the performance lift may be an artifact of fitting to the evaluation distribution rather than evidence that minimizing d_NTP reliably improves vectors on new tasks.
- The negative correlation between d_NTP and accuracy is used to justify LTV, but no out-of-distribution experiments test whether this correlation (and thus the benefit of the regression) persists for unseen task distributions, prompt formats, or model scales beyond the reported transfer result.
minor comments (2)
- The abstract states that LTV also outperforms baselines on regression tasks but supplies no quantitative numbers, specific datasets, or comparison tables; adding these details would improve completeness.
- Notation for d_NTP and the linear mapping matrix should be introduced with an explicit equation in the main text to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes we will make to the manuscript.
read point-by-point responses
-
Referee: The reported 9.2% average accuracy improvement (Abstract) is presented without statistical significance tests, error bars, or analysis of sensitivity to prompt variations or random seeds. This weakens confidence that the gains are stable rather than tied to particular experimental choices on the eight benchmarks.
Authors: We agree that the lack of statistical tests and error bars reduces confidence in the stability of the reported gains. In the revised manuscript we will add error bars computed over multiple random seeds, report standard deviations, include paired statistical significance tests on the accuracy improvements, and provide an analysis of sensitivity to prompt template variations and seed choices in the experimental section. revision: yes
-
Referee: The construction of LTV relies on a linear mapping fitted to minimize d_NTP, yet the manuscript does not clarify whether this regression uses held-out tasks or the same eight classification benchmarks on which the d_NTP-accuracy correlation was measured. If the latter, the performance lift may be an artifact of fitting to the evaluation distribution rather than evidence that minimizing d_NTP reliably improves vectors on new tasks.
Authors: The linear regression for LTV is performed independently for each task using only the in-context demonstrations of that task; the d_NTP-accuracy correlation was obtained by evaluating multiple task-vector methods across the same set of benchmarks. Because test accuracy is measured on held-out test examples separate from the demonstrations, the procedure does not fit to the evaluation distribution. We will revise the methods section to make this per-task, demonstration-only fitting explicit and to clarify the separation between demonstration data and test evaluation. revision: yes
-
Referee: The negative correlation between d_NTP and accuracy is used to justify LTV, but no out-of-distribution experiments test whether this correlation (and thus the benefit of the regression) persists for unseen task distributions, prompt formats, or model scales beyond the reported transfer result.
Authors: We acknowledge that broader out-of-distribution validation would strengthen the claim that minimizing d_NTP is a reliable design principle. Our existing transfer results across model scales already provide some evidence of generalization. In the revision we will add experiments on additional unseen tasks and prompt-format variations to test whether the d_NTP-accuracy correlation and the advantage of LTV hold under these shifts. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper defines d_NTP independently as the discrepancy in next-token probabilities between task-vector and ICL inference. It reports an empirical negative correlation between d_NTP and downstream accuracy on the eight benchmarks. LTV is then derived as a closed-form linear regressor whose explicit objective is to minimize d_NTP; accuracy is never an input to the regression. Because the fitted quantity (d_NTP) is distinct from the evaluation metric (accuracy) and the construction does not rename a fit to accuracy as a prediction of accuracy, no step reduces to its own inputs by definition. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text. The central claim therefore rests on external empirical validation rather than circular reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- linear mapping matrix
Reference graph
Works this paper leans on
-
[1]
Deep Learning using Rectified Linear Units (ReLU)
Abien Fred Agarap. Deep learning using rectified linear units (relu).arXiv preprint arXiv:1803.08375, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. Advances in Neural Information Processing Systems, 37:76930–76966, 2024
work page 2024
-
[3]
Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to imple- ment preconditioned gradient descent for in-context learning.Advances in Neural Information Processing Systems, 36:45614–45650, 2023
work page 2023
-
[4]
What learning algorithm is in-context learning? Investigations with linear models
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models.arXiv preprint arXiv:2211.15661, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Task prompt vectors: Effective initialization through multi-task soft prompt transfer
Robert Belanec, Simon Ostermann, Ivan Srba, and Maria Bielikova. Task prompt vectors: Effective initialization through multi-task soft prompt transfer. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 77–94. Springer, 2025
work page 2025
-
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[7]
Semeval- 2019 task 3: Emocontext contextual emotion detection in text
Ankush Chatterjee, Kedhar Nath Narahari, Meghana Joshi, and Puneet Agrawal. Semeval- 2019 task 3: Emocontext contextual emotion detection in text. InProceedings of the 13th international workshop on semantic evaluation, pages 39–48, 2019
work page 2019
-
[8]
Hate Speech Dataset from a White Supremacy Forum
Ona De Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. Hate speech dataset from a white supremacy forum.arXiv preprint arXiv:1809.04444, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
A survey on in-context learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024
work page 2024
-
[10]
Yuxin Dong, Jiachen Jiang, Zhihui Zhu, and Xia Ning. Understanding task vectors in in-context learning: Emergence, functionality, and limitations.arXiv preprint arXiv:2506.09048, 2025
-
[11]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[12]
Boyan Gao, Xin Wang, Yibo Yang, and David A. Clifton. Optimization inspired few-shot adap- tation for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=rZ2nSt1X58
work page 2025
-
[13]
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022
work page 2022
-
[14]
Towards compute-optimal many-shot in-context learning
Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Towards compute-optimal many-shot in-context learning. InSecond Conference on Language Modeling, 2025
work page 2025
-
[15]
Emergence and effectiveness of task vectors in in-context learning: An encoder decoder perspective
Seungwook Han, Jinyeop Song, Jeff Gore, and Pulkit Agrawal. Emergence and effectiveness of task vectors in in-context learning: An encoder decoder perspective. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=0ysC6VS0y3
work page 2025
-
[16]
In-context learning creates task vectors
Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[17]
Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970. 10
work page 1970
-
[18]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9
work page 2022
-
[19]
Editing models with task arithmetic
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[20]
Josip Juki ´c and Jan Šnajder. Disentangling latent shifts of in-context learning with weak supervision.arXiv preprint arXiv:2410.01508, 2024
-
[21]
Adaptive task vectors for large language models.arXiv preprint arXiv:2506.03426, 2025
Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, and Kyungwoo Song. Adaptive task vectors for large language models.arXiv preprint arXiv:2506.03426, 2025
-
[22]
On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951
Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951
work page 1951
-
[23]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021
work page 2021
-
[24]
Dongfang Li, Zhenyu Liu, Xinshuo Hu, Zetian Sun, Baotian Hu, and Min Zhang. In-context learning state vector with inner and momentum optimization.Advances in Neural Information Processing Systems, 37:7797–7820, 2024
work page 2024
-
[25]
Hongkang Li, Yihua Zhang, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. When is task vector provably effective for model editing? a generalization analysis of nonlinear transformers.arXiv preprint arXiv:2504.10957, 2025
-
[26]
Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, and Wenya Wang. Towards generalizable implicit in-context learning with attention routing.arXiv preprint arXiv:2509.22854, 2025
-
[27]
Trans- formers as algorithms: Generalization and stability in in-context learning
Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Trans- formers as algorithms: Generalization and stability in in-context learning. InInternational conference on machine learning, pages 19565–19594. PMLR, 2023
work page 2023
-
[28]
Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, and Dimitris N. Metaxas. Implicit in-context learning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=G7u4ue6ncT
work page 2025
-
[29]
Sheng Liu, Haotian Ye, Lei Xing, and James Y Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[30]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, and Dhanya Sridhar. Does learning the right latent variables necessarily improve in-context learning? In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[32]
Learning to compress prompts with gist tokens
Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36:19327–19352, 2023
work page 2023
-
[33]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models.Advances in Neural Information Processing Systems, 36:66727–66754, 2023. 11
work page 2023
-
[35]
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts
Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.arXiv preprint cs/0409058, 2004
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[36]
Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.arXiv preprint cs/0506075, 2005
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[37]
In-context learning through the bayesian prism
Madhur Panwar, Kabir Ahuja, and Navin Goyal. In-context learning through the bayesian prism. arXiv preprint arXiv:2306.04891, 2023
-
[38]
A Yang Qwen, Baosong Yang, B Zhang, B Hui, B Zheng, B Yu, Chengpeng Li, D Liu, F Huang, H Wei, et al. Qwen2. 5 technical report.arXiv preprint, 2024
work page 2024
-
[39]
Learning task representations from in-context learning
Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, and Amin Karbasi. Learning task representations from in-context learning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6634–6663, 2025
work page 2025
-
[40]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013
work page 2013
-
[41]
Function vectors in large language models
Eric Todd, Millicent Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[44]
Transformers learn in-context by gradient descent
Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023
work page 2023
-
[45]
Building a question answering test collection
Ellen M V oorhees and Dawn M Tice. Building a question answering test collection. InProceed- ings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207, 2000
work page 2000
-
[46]
Elicit: Llm augmentation via external in-context capability
Futing Wang, Jianhao Yan, Yue Zhang, and Tao Lin. Elicit: Llm augmentation via external in-context capability. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[47]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Task vectors in in-context learning: Emergence, formation, and benefit.CoRR, 2025
Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, and Robert D Nowak. Task vectors in in-context learning: Emergence, formation, and benefit.CoRR, 2025
work page 2025
-
[51]
Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010, 2025
Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010, 2025. 12
-
[52]
Knowledge composition using task vectors with learned anisotropic scaling
Frederic Z Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Knowledge composition using task vectors with learned anisotropic scaling. Advances in Neural Information Processing Systems, 37:67319–67354, 2024
work page 2024
-
[53]
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015
work page 2015
-
[54]
Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023
-
[55]
The mystery of in-context learning: A comprehensive survey on interpretation and analysis
Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. The mystery of in-context learning: A comprehensive survey on interpretation and analysis. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14365– 14378, 2024. 13 A Appendix A.1 Limitation and Future Work We describe several limitati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.