Activation-Based Active Learning for In-Context Learning: Challenges and Insights

Geoff V. Merrett; Stuart E. Middleton; Yaseen M. Osman

arxiv: 2606.05134 · v1 · pith:D22TNB3Mnew · submitted 2026-06-03 · 💻 cs.CL · cs.LG

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

Yaseen M. Osman , Geoff V. Merrett , Stuart E. Middleton This is my paper

Pith reviewed 2026-06-28 06:25 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords active learningin-context learningMLP activationsmassive activationsSpearman correlationsuperpositionexample selectionLLM

0 comments

The pith

MLP activation statistics show at most 0.33 Spearman correlation with in-context example quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether MLP activations in transformers can supply a useful signal for choosing high-quality in-context examples via active learning. The authors examine massive activations and the first four statistical moments across Llama-3.2-3B and Qwen2.5-3B models, multiple datasets, and attention masking strategies. They report that these statistics correlate only weakly with task performance. A sympathetic reader would care because the negative result questions the viability of activation-based selection methods for in-context learning and highlights possible limits imposed by how models encode features.

Core claim

MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models tested, showing that such activation-based sampling should not be used for in-context learning. The authors hypothesise that superposition may explain the absence of signal and suggest sparse autoencoders as a possible next step.

What carries the argument

Massive activations and the first four moments computed from MLP layer outputs, used as candidate scores for ranking candidate in-context examples during active learning.

If this is right

Activation-based sampling that relies on massive activations or statistical moments should not be applied to in-context example selection.
The lack of correlation persists across classification and generative tasks and across different attention masking strategies.
The same negative outcome appears for both the Llama-3.2-3B and Qwen2.5-3B base models.
Methods that disentangle features, such as sparse autoencoders, become a logical direction for future activation-based active learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If a different decomposition of the same MLP activations were used, a stronger correlation with example quality might appear.
Active learning for in-context learning may need to rely on signals other than raw activation statistics, such as uncertainty or diversity measures.
The finding raises the possibility that superposition is a general obstacle to activation-based selection across current transformer designs.

Load-bearing premise

The chosen statistics of massive activations and first four moments are sufficient to detect any useful signal about example quality that exists in the MLP activations.

What would settle it

Finding a Spearman correlation above 0.5 between these same activation statistics and task performance on a held-out model or dataset would falsify the central negative result.

Figures

Figures reproduced from arXiv: 2606.05134 by Geoff V. Merrett, Stuart E. Middleton, Yaseen M. Osman.

**Figure 2.** Figure 2: Shows the accuracy of Llama-3.2-3B on ARC-C when sampling examples argmin, argmedian, and argmax [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Shows the Spearman correlation and p-values between task performance and different activation-based metrics for GSM8K using Llama-3.2-3B. that yield more reliable scores (e.g., model uncertainty or attention weights). We hope these findings contribute to both LLMs’ explainability and active learning efforts. Limitations We examine only non-instruction-tuned and 3B models variants due to the computational c… view at source ↗

read the original abstract

Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a clear negative on two specific activation statistics for ICL example selection but overreaches when it generalizes to all activation-based sampling.

read the letter

The core finding is that massive activations and the first four moments of MLP outputs show at most 0.33 absolute Spearman correlation with example quality or downstream performance on the tested models and tasks. That negative result looks solid for the methods they actually ran.

What the paper does well is run a reasonably broad set of experiments: two 3B-scale base models, multiple classification and generative datasets, and several attention-masking regimes. Reporting the negative outcome directly, rather than hunting for a positive spin, is useful. It also correctly flags superposition as a plausible reason the simple statistics failed and points to SAEs as a logical next step.

The soft spot is the jump from “these two views of the activations do not work” to “such activation-based sampling should not be used.” The stress-test note is right: if the useful signal lives in a different linear combination or sparse direction, the chosen lenses would miss it. The paper itself raises superposition, so the broader recommendation sits on an assumption that is not fully tested. Without the full methods section I cannot judge dataset sizes, exact masking code, or whether the 0.33 bound holds after multiple-comparison correction, but the abstract alone already shows the claim is scoped more narrowly than the final sentence.

This is the kind of targeted negative result that belongs in the ICL and active-learning literature. Readers working on prompting or example selection will want to see the details and decide whether the tested statistics were the right ones to try first. It is worth sending to peer review so the experimental choices and the precise scope of the conclusion can be checked.

Referee Report

1 major / 2 minor

Summary. The paper tests whether MLP activations can serve as a fine-grained signal for active learning in LLM in-context example selection. Across Llama-3.2-3B and Qwen2.5-3B on diverse classification and generative tasks, with varying attention masking, the authors compute Spearman correlations between two views of MLP outputs (massive activations and the first four moments) and measures of example quality or downstream performance, reporting a maximum absolute correlation of 0.33. They conclude that activation-based sampling should not be used and hypothesize that superposition may explain the lack of signal, proposing SAEs as a future direction.

Significance. If the reported correlations are robust, the negative empirical result is a useful contribution to deep active learning for in-context learning: it supplies concrete evidence that simple activation statistics are unlikely to identify high-quality examples and supplies an explicit hypothesis (superposition) together with a concrete next step (SAEs). The breadth of the experimental design—two base models, multiple task types, and explicit attention-masking variants—is a clear strength that increases the reliability of the null finding.

major comments (1)

[Abstract] Abstract: the statement that 'such activation-based sampling should not be used for in-context learning' is load-bearing for the paper's central recommendation, yet it is derived only from the two tested lenses (massive activations and first four moments). The paper's own superposition hypothesis implies that other representations of the same MLP activations could still carry usable signal; therefore the recommendation should be scoped explicitly to the statistics examined rather than to activation-based methods in general.

minor comments (2)

[Methods] Methods: supply the precise algorithmic definitions and any hyperparameters (thresholds, normalization, masking implementation) used to extract massive activations and the four moments so that the correlation computations can be reproduced.
[Results] Results: report p-values or confidence intervals alongside the Spearman coefficients to allow readers to judge whether the observed |ρ| ≤ 0.33 values are statistically distinguishable from zero or from a practically meaningful threshold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract's recommendation should be scoped more precisely to the activation statistics tested, and we will make this revision.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'such activation-based sampling should not be used for in-context learning' is load-bearing for the paper's central recommendation, yet it is derived only from the two tested lenses (massive activations and first four moments). The paper's own superposition hypothesis implies that other representations of the same MLP activations could still carry usable signal; therefore the recommendation should be scoped explicitly to the statistics examined rather than to activation-based methods in general.

Authors: We agree with this observation. Our negative result applies specifically to the two views of MLP activations examined (massive activations and the first four moments), and the superposition hypothesis we advance does leave open the possibility that alternative representations of the same activations could yield usable signal. We will revise the abstract to state that the tested activation statistics should not be used for in-context learning, rather than activation-based sampling in general. The revised wording will be: 'showing that the activation statistics examined in this work should not be used for in-context learning.' This change will also be reflected in the conclusion section for consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical correlations reported without reduction to fitted inputs or self-citations

full rationale

The paper's central claim rests on computing Spearman rank correlations between two specific views of MLP activations (massive activations and first four moments) and task performance/example quality across models and datasets. These are straightforward statistical measurements on held-out data with no equations, fitted parameters, or self-citations that reduce the reported |ρ| ≤ 0.33 result to the inputs by construction. The superposition hypothesis is explicitly labeled as speculation for future work (SAEs) rather than a load-bearing premise. No self-definitional, fitted-input, or uniqueness patterns appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that Spearman correlation between activation statistics and task performance is an appropriate test of whether activations provide a usable selection signal, plus the modeling assumption that the two 3B base models and chosen task suite are representative enough to support the general recommendation against activation-based sampling.

axioms (1)

domain assumption Spearman rank correlation is a valid measure of association between activation-derived scalars and downstream task performance
Invoked when the paper reports absolute Spearman coefficients as the decisive evidence that activations do not correlate with example quality.

pith-pipeline@v0.9.1-grok · 5736 in / 1405 out tokens · 44895 ms · 2026-06-28T06:25:00.607987+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages

[1]

2024 , url=

Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

2024
[2]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[3]

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , url =

Chen, Shijie and Jimenez Gutierrez, Bernal and Su, Yu , booktitle =. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , url =
[4]

First Conference on Language Modeling , year=

Massive Activations in Large Language Models , author=. First Conference on Language Modeling , year=
[5]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[6]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

2022
[7]

doi:10.5281/zenodo.10256836 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.10256836
[8]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , url =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , url =
[9]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[11]

The Thirteenth International Conference on Learning Representations , year=

Repetition Improves Language Model Embeddings , author=. The Thirteenth International Conference on Learning Representations , year=
[12]

arXiv preprint arXiv:2512.14982 , year=

Prompt Repetition Improves Non-Reasoning LLMs , author=. arXiv preprint arXiv:2512.14982 , year=

arXiv
[13]

arXiv preprint arXiv:2505.20195 , year=

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning , author=. arXiv preprint arXiv:2505.20195 , year=

arXiv
[14]

Man Luo and Xin Xu and Zhuyun Dai and Panupong Pasupat and Mehran Kazemi and Chitta Baral and Vaiva Imbrasaite and Vincent Y Zhao , booktitle=. Dr. 2023 , url=

2023
[15]

arXiv preprint arXiv:2310.20046 , year=

Which examples to annotate for in-context learning? towards effective and efficient selection , author=. arXiv preprint arXiv:2310.20046 , year=

arXiv
[16]

arXiv preprint arXiv:2603.27385 , year=

Active In-Context Learning for Tabular Foundation Models , author=. arXiv preprint arXiv:2603.27385 , year=

arXiv
[17]

Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025 , year=

Contrastive In-Context Learning with Active Memory for Task Planning , author=. Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025 , year=

2025
[18]

Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages =

Malik, Vijit and Pande, Atul and Majumder, Anirban , title =. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages =. 2025 , isbn =. doi:10.1145/3746252.3761536 , abstract =

work page doi:10.1145/3746252.3761536 2025
[19]

LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition , volume =

Mingyi Liu and Zhiying Tu and Tong Zhang and Tonghua Su and Xiaofei Xu and Zhongjie Wang , doi =. LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition , volume =. Neural Processing Letters , month =
[20]

Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence , pages =

Zhang, Leihan and Zhang, Le , title =. Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence , pages =. 2020 , isbn =. doi:10.1145/3374587.3374611 , abstract =

work page doi:10.1145/3374587.3374611 2019
[21]

, booktitle=

Shelmanov, Artem and Liventsev, Vadim and Kireev, Danil and Khromov, Nikita and Panchenko, Alexander and Fedulova, Irina and Dylov, Dmitry V. , booktitle=. Active Learning with Deep Pre-trained Models for Sequence Tagging of Clinical and Biomedical Texts , year=. doi:10.1109/BIBM47256.2019.8983157 , ISSN=

work page doi:10.1109/bibm47256.2019.8983157 2019
[22]

Active Learning Literature Survey , Type =

Burr Settles , Institution =. Active Learning Literature Survey , Type =
[23]

2001 , booktitle =

Tong, Simon , pages =. 2001 , booktitle =

2001
[24]

Active Learning with Statistical Models , url =

Cohn, David and Ghahramani, Zoubin and Jordan, Michael , booktitle =. Active Learning with Statistical Models , url =
[25]

Proceedings of the 40th International Conference on Machine Learning , pages =

Compositional Exemplars for In-context Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[26]

arXiv preprint arXiv:2506.01115 , year=

Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer , author=. arXiv preprint arXiv:2506.01115 , year=

arXiv
[27]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[28]

Alignment Forum , volume=

Fact finding: Attempting to reverse-engineer factual recall on the neuron level , author=. Alignment Forum , volume=
[29]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =
[30]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[31]

arXiv preprint arXiv:2407.10671 , year=

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2508.09199 , year=

-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation , author=. arXiv preprint arXiv:2508.09199 , year=

arXiv
[34]

On the Role of Attention Masks and LayerNorm in Transformers , url =

Wu, Xinyi and Ajorlou, Amir and Wang, Yifei and Jegelka, Stefanie and Jadbabaie, Ali , booktitle =. On the Role of Attention Masks and LayerNorm in Transformers , url =. doi:10.52202/079017-0472 , editor =

work page doi:10.52202/079017-0472
[35]

Ayadi, F

Swietojanski, Pawel and Braun, Stefan and Can, Dogan and Da Silva, Thiago Fraga and Ghoshal, Arnab and Hori, Takaaki and Hsiao, Roger and Mason, Henry and McDermott, Erik and Silovsky, Honza and Travadi, Ruchir and Zhuang, Xiaodan , booktitle=. Variable Attention Masking for Configurable Transformer Transducer Speech Recognition , year=. doi:10.1109/ICASS...

work page doi:10.1109/icassp49357.2023.10094588 2023
[36]

Transformers need glasses! Information over-squashing in language tasks , url =

Barbero, Federico and Banino, Andrea and Kapturowski, Steven and Kumaran, Dharshan and Ara\'. Transformers need glasses! Information over-squashing in language tasks , url =. Advances in Neural Information Processing Systems , doi =
[37]

Federico Barbero and Alvaro Arroyo and Xiangming Gu and Christos Perivolaropoulos and Petar Veli. Why do. Second Conference on Language Modeling , year=
[38]

Look Both Ways and No Sink: Converting LLM s into Text Encoders without Training

Lin, Ziyong and Wu, Haoyi and Wang, Shu and Tu, Kewei and Zheng, Zilong and Jia, Zixia. Look Both Ways and No Sink: Converting LLM s into Text Encoders without Training. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1113

work page doi:10.18653/v1/2025.acl-long.1113 2025
[39]

Active Learning Principles for In-Context Learning with Large Language Models

Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334

work page doi:10.18653/v1/2023.findings-emnlp.334 2023
[40]

What Makes Good In-Context Examples for GPT -3?

Liu, Jiachang and Shen, Dinghan and Zhang, Yizhe and Dolan, Bill and Carin, Lawrence and Chen, Weizhu. What Makes Good In-Context Examples for GPT -3?. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 2022. doi:10.18653/v1/2022.deelio-1.10

work page doi:10.18653/v1/2022.deelio-1.10 2022
[41]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019
[42]

Israel, D., Broeck, G

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018
[43]

M onte C arlo Sampling for Analyzing In-Context Examples

Schoch, Stephanie and Ji, Yangfeng. M onte C arlo Sampling for Analyzing In-Context Examples. The Sixth Workshop on Insights from Negative Results in NLP. 2025. doi:10.18653/v1/2025.insights-1.7

work page doi:10.18653/v1/2025.insights-1.7 2025
[44]

Active Example Selection for In-Context Learning

Zhang, Yiming and Feng, Shi and Tan, Chenhao. Active Example Selection for In-Context Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.622

work page doi:10.18653/v1/2022.emnlp-main.622 2022
[45]

Finding Support Examples for In-Context Learning

Li, Xiaonan and Qiu, Xipeng. Finding Support Examples for In-Context Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.411

work page doi:10.18653/v1/2023.findings-emnlp.411 2023
[46]

A ctive L earning for BERT : A n E mpirical S tudy

Ein-Dor, Liat and Halfon, Alon and Gera, Ariel and Shnarch, Eyal and Dankin, Lena and Choshen, Leshem and Danilevsky, Marina and Aharonov, Ranit and Katz, Yoav and Slonim, Noam. A ctive L earning for BERT : A n E mpirical S tudy. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnl...

work page doi:10.18653/v1/2020.emnlp-main.638 2020
[47]

A ctive LLM : Large Language Model-Based Active Learning for Textual Few-Shot Scenarios

Bayer, Markus and Lutz, Justin and Reuter, Christian. A ctive LLM : Large Language Model-Based Active Learning for Textual Few-Shot Scenarios. Transactions of the Association for Computational Linguistics. 2026. doi:10.1162/tacl.a.63

work page doi:10.1162/tacl.a.63 2026
[48]

Learning To Retrieve Prompts for In-Context Learning

Rubin, Ohad and Herzig, Jonathan and Berant, Jonathan. Learning To Retrieve Prompts for In-Context Learning. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.191

work page doi:10.18653/v1/2022.naacl-main.191 2022
[49]

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

Wu, Zhiyong and Wang, Yaoxiang and Ye, Jiacheng and Kong, Lingpeng. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.79

work page doi:10.18653/v1/2023.acl-long.79 2023
[50]

More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for LLM s with In-Context Sampling

Yao, Bingsheng and Chen, Guiming and Zou, Ruishi and Lu, Yuxuan and Li, Jiachen and Zhang, Shao and Sang, Yisi and Liu, Sijia and Hendler, James and Wang, Dakuo. More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for LLM s with In-Context Sampling. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi...

work page doi:10.18653/v1/2024.findings-naacl.115 2024
[51]

In-Context Learning with Iterative Demonstration Selection

Qin, Chengwei and Zhang, Aston and Chen, Chen and Dagar, Anirudh and Ye, Wenming. In-Context Learning with Iterative Demonstration Selection. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.438

work page doi:10.18653/v1/2024.findings-emnlp.438 2024

[1] [1]

2024 , url=

Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , booktitle=. 2024 , url=

2024

[2] [2]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

[3] [3]

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , url =

Chen, Shijie and Jimenez Gutierrez, Bernal and Su, Yu , booktitle =. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers , url =

[4] [4]

First Conference on Language Modeling , year=

Massive Activations in Large Language Models , author=. First Conference on Language Modeling , year=

[5] [5]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[6] [6]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

2022

[7] [7]

doi:10.5281/zenodo.10256836 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.10256836

[8] [8]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , url =

Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel , booktitle =. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , url =

[9] [9]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[11] [11]

The Thirteenth International Conference on Learning Representations , year=

Repetition Improves Language Model Embeddings , author=. The Thirteenth International Conference on Learning Representations , year=

[12] [12]

arXiv preprint arXiv:2512.14982 , year=

Prompt Repetition Improves Non-Reasoning LLMs , author=. arXiv preprint arXiv:2512.14982 , year=

arXiv

[13] [13]

arXiv preprint arXiv:2505.20195 , year=

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning , author=. arXiv preprint arXiv:2505.20195 , year=

arXiv

[14] [14]

Man Luo and Xin Xu and Zhuyun Dai and Panupong Pasupat and Mehran Kazemi and Chitta Baral and Vaiva Imbrasaite and Vincent Y Zhao , booktitle=. Dr. 2023 , url=

2023

[15] [15]

arXiv preprint arXiv:2310.20046 , year=

Which examples to annotate for in-context learning? towards effective and efficient selection , author=. arXiv preprint arXiv:2310.20046 , year=

arXiv

[16] [16]

arXiv preprint arXiv:2603.27385 , year=

Active In-Context Learning for Tabular Foundation Models , author=. arXiv preprint arXiv:2603.27385 , year=

arXiv

[17] [17]

Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025 , year=

Contrastive In-Context Learning with Active Memory for Task Planning , author=. Workshop on Making Sense of Data in Robotics: Composition, Curation, and Interpretability at Scale at CoRL 2025 , year=

2025

[18] [18]

Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages =

Malik, Vijit and Pande, Atul and Majumder, Anirban , title =. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages =. 2025 , isbn =. doi:10.1145/3746252.3761536 , abstract =

work page doi:10.1145/3746252.3761536 2025

[19] [19]

LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition , volume =

Mingyi Liu and Zhiying Tu and Tong Zhang and Tonghua Su and Xiaofei Xu and Zhongjie Wang , doi =. LTP: A New Active Learning Strategy for CRF-Based Named Entity Recognition , volume =. Neural Processing Letters , month =

[20] [20]

Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence , pages =

Zhang, Leihan and Zhang, Le , title =. Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence , pages =. 2020 , isbn =. doi:10.1145/3374587.3374611 , abstract =

work page doi:10.1145/3374587.3374611 2019

[21] [21]

, booktitle=

Shelmanov, Artem and Liventsev, Vadim and Kireev, Danil and Khromov, Nikita and Panchenko, Alexander and Fedulova, Irina and Dylov, Dmitry V. , booktitle=. Active Learning with Deep Pre-trained Models for Sequence Tagging of Clinical and Biomedical Texts , year=. doi:10.1109/BIBM47256.2019.8983157 , ISSN=

work page doi:10.1109/bibm47256.2019.8983157 2019

[22] [22]

Active Learning Literature Survey , Type =

Burr Settles , Institution =. Active Learning Literature Survey , Type =

[23] [23]

2001 , booktitle =

Tong, Simon , pages =. 2001 , booktitle =

2001

[24] [24]

Active Learning with Statistical Models , url =

Cohn, David and Ghahramani, Zoubin and Jordan, Michael , booktitle =. Active Learning with Statistical Models , url =

[25] [25]

Proceedings of the 40th International Conference on Machine Learning , pages =

Compositional Exemplars for In-context Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[26] [26]

arXiv preprint arXiv:2506.01115 , year=

Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer , author=. arXiv preprint arXiv:2506.01115 , year=

arXiv

[27] [27]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[28] [28]

Alignment Forum , volume=

Fact finding: Attempting to reverse-engineer factual recall on the neuron level , author=. Alignment Forum , volume=

[29] [29]

and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

[30] [30]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

[31] [31]

arXiv preprint arXiv:2407.10671 , year=

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2508.09199 , year=

-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation , author=. arXiv preprint arXiv:2508.09199 , year=

arXiv

[34] [34]

On the Role of Attention Masks and LayerNorm in Transformers , url =

Wu, Xinyi and Ajorlou, Amir and Wang, Yifei and Jegelka, Stefanie and Jadbabaie, Ali , booktitle =. On the Role of Attention Masks and LayerNorm in Transformers , url =. doi:10.52202/079017-0472 , editor =

work page doi:10.52202/079017-0472

[35] [35]

Ayadi, F

Swietojanski, Pawel and Braun, Stefan and Can, Dogan and Da Silva, Thiago Fraga and Ghoshal, Arnab and Hori, Takaaki and Hsiao, Roger and Mason, Henry and McDermott, Erik and Silovsky, Honza and Travadi, Ruchir and Zhuang, Xiaodan , booktitle=. Variable Attention Masking for Configurable Transformer Transducer Speech Recognition , year=. doi:10.1109/ICASS...

work page doi:10.1109/icassp49357.2023.10094588 2023

[36] [36]

Transformers need glasses! Information over-squashing in language tasks , url =

Barbero, Federico and Banino, Andrea and Kapturowski, Steven and Kumaran, Dharshan and Ara\'. Transformers need glasses! Information over-squashing in language tasks , url =. Advances in Neural Information Processing Systems , doi =

[37] [37]

Federico Barbero and Alvaro Arroyo and Xiangming Gu and Christos Perivolaropoulos and Petar Veli. Why do. Second Conference on Language Modeling , year=

[38] [38]

Look Both Ways and No Sink: Converting LLM s into Text Encoders without Training

Lin, Ziyong and Wu, Haoyi and Wang, Shu and Tu, Kewei and Zheng, Zilong and Jia, Zixia. Look Both Ways and No Sink: Converting LLM s into Text Encoders without Training. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1113

work page doi:10.18653/v1/2025.acl-long.1113 2025

[39] [39]

Active Learning Principles for In-Context Learning with Large Language Models

Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334

work page doi:10.18653/v1/2023.findings-emnlp.334 2023

[40] [40]

What Makes Good In-Context Examples for GPT -3?

Liu, Jiachang and Shen, Dinghan and Zhang, Yizhe and Dolan, Bill and Carin, Lawrence and Chen, Weizhu. What Makes Good In-Context Examples for GPT -3?. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 2022. doi:10.18653/v1/2022.deelio-1.10

work page doi:10.18653/v1/2022.deelio-1.10 2022

[41] [41]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019

[42] [42]

Israel, D., Broeck, G

Mihaylov, Todor and Clark, Peter and Khot, Tushar and Sabharwal, Ashish. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1260

work page doi:10.18653/v1/d18-1260 2018

[43] [43]

M onte C arlo Sampling for Analyzing In-Context Examples

Schoch, Stephanie and Ji, Yangfeng. M onte C arlo Sampling for Analyzing In-Context Examples. The Sixth Workshop on Insights from Negative Results in NLP. 2025. doi:10.18653/v1/2025.insights-1.7

work page doi:10.18653/v1/2025.insights-1.7 2025

[44] [44]

Active Example Selection for In-Context Learning

Zhang, Yiming and Feng, Shi and Tan, Chenhao. Active Example Selection for In-Context Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.622

work page doi:10.18653/v1/2022.emnlp-main.622 2022

[45] [45]

Finding Support Examples for In-Context Learning

Li, Xiaonan and Qiu, Xipeng. Finding Support Examples for In-Context Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.411

work page doi:10.18653/v1/2023.findings-emnlp.411 2023

[46] [46]

A ctive L earning for BERT : A n E mpirical S tudy

Ein-Dor, Liat and Halfon, Alon and Gera, Ariel and Shnarch, Eyal and Dankin, Lena and Choshen, Leshem and Danilevsky, Marina and Aharonov, Ranit and Katz, Yoav and Slonim, Noam. A ctive L earning for BERT : A n E mpirical S tudy. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnl...

work page doi:10.18653/v1/2020.emnlp-main.638 2020

[47] [47]

A ctive LLM : Large Language Model-Based Active Learning for Textual Few-Shot Scenarios

Bayer, Markus and Lutz, Justin and Reuter, Christian. A ctive LLM : Large Language Model-Based Active Learning for Textual Few-Shot Scenarios. Transactions of the Association for Computational Linguistics. 2026. doi:10.1162/tacl.a.63

work page doi:10.1162/tacl.a.63 2026

[48] [48]

Learning To Retrieve Prompts for In-Context Learning

Rubin, Ohad and Herzig, Jonathan and Berant, Jonathan. Learning To Retrieve Prompts for In-Context Learning. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.191

work page doi:10.18653/v1/2022.naacl-main.191 2022

[49] [49]

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

Wu, Zhiyong and Wang, Yaoxiang and Ye, Jiacheng and Kong, Lingpeng. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.79

work page doi:10.18653/v1/2023.acl-long.79 2023

[50] [50]

More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for LLM s with In-Context Sampling

Yao, Bingsheng and Chen, Guiming and Zou, Ruishi and Lu, Yuxuan and Li, Jiachen and Zhang, Shao and Sang, Yisi and Liu, Sijia and Hendler, James and Wang, Dakuo. More Samples or More Prompts? Exploring Effective Few-Shot In-Context Learning for LLM s with In-Context Sampling. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi...

work page doi:10.18653/v1/2024.findings-naacl.115 2024

[51] [51]

In-Context Learning with Iterative Demonstration Selection

Qin, Chengwei and Zhang, Aston and Chen, Chen and Dagar, Anirudh and Ye, Wenming. In-Context Learning with Iterative Demonstration Selection. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.438

work page doi:10.18653/v1/2024.findings-emnlp.438 2024