Self-Improving In-Context Learning

Baturay Saglam; Dionysis Kalogerias

arxiv: 2605.23180 · v1 · pith:BJSFZV23new · submitted 2026-05-22 · 💻 cs.CL · cs.LG

Self-Improving In-Context Learning

Baturay Saglam , Dionysis Kalogerias This is my paper

Pith reviewed 2026-05-25 04:53 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords in-context learningprompt optimizationtest-time calibrationself-supervised proxyzeroth-order optimizationfew-shot prompting

0 comments

The pith

Optimizing the continuous embeddings of a fixed few-shot prompt at test time improves in-context learning by maximizing a log-probability proxy on the demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the log-probabilities a model assigns to the outputs in its own few-shot demonstrations form a usable signal for how well the model has inferred the task. This signal is turned into a bounded self-supervised confidence proxy that can be maximized at test time through zeroth-order optimization of the prompt embeddings. The procedure requires no finetuning, no token generation, no external data, and works for both classification and free-form generation. Experiments across a suite of ICL tasks show the calibrated prompts match or beat the base model and outperform classification-specific baselines on most tasks, with a statistically significant correlation between proxy gains and accuracy gains.

Core claim

The central claim is that the log-probabilities assigned to demonstrated outputs, available from a single forward pass, constitute a reliable optimization signal for in-context learning; maximizing a formal bounded confidence proxy derived from them via zeroth-order search over prompt embeddings yields better task performance on unseen inputs from the same fixed demonstrations.

What carries the argument

A bounded self-supervised confidence proxy derived from the log-probabilities of demonstrated outputs, maximized over continuous prompt embeddings via zeroth-order optimization.

If this is right

The calibration procedure matches or improves base-model performance across a range of ICL tasks.
It outperforms classification-specific baselines on most evaluated tasks.
Statistically significant correlation exists between improvement in the proxy and gains in downstream accuracy.
The same procedure applies without modification to both classification and free-form generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on whether the same proxy can guide selection or reweighting of which demonstrations to include in the prompt.
It implies that prompt embeddings can be treated as continuous parameters for inference-time adaptation even when the underlying model weights stay frozen.
Real-time deployment on streaming inputs might become feasible if the zeroth-order steps can be limited to a small fixed budget per query.

Load-bearing premise

Maximizing the log-probability proxy computed on the fixed demonstrations will produce better predictions on unseen test inputs rather than merely fitting the demonstrations more closely.

What would settle it

If optimizing the proxy produces no corresponding increase (or produces a decrease) in accuracy on held-out test examples, while the base unoptimized prompt remains unchanged, the claim that the proxy encodes a reliable downstream signal would be falsified.

Figures

Figures reproduced from arXiv: 2605.23180 by Baturay Saglam, Dionysis Kalogerias.

**Figure 1.** Figure 1: Proxy improvement versus accuracy improvement across all 12 ICLEval tasks. A one-sided Spearman rank correlation test (𝐻1 ∶ 𝜌 > 0) yields a statistically significant positive association across all models and tasks combined. (C) (R) (G) Format Check Order Adj. De-duplication Count & Nav. Rel. Analysis Format Conv. Str. Completion List Map. Format Cloning Dup. Check Dict. Search Order Check -0.19 -0.08 -0.0… view at source ↗

**Figure 2.** Figure 2: Ablation studies of the proposed three-component proxy (𝛼=0.6, 𝛽=0.3, 𝛾=0.1) and the perturbation domain used in end-to-end calibration. (a) Each proxy component is ablated by setting its coefficient to zero and renormalizing the remaining weights. (b) Accuracy difference between the default perturbation domain (demonstration embeddings only) and full-sequence perturbation that includes query positions. Re… view at source ↗

**Figure 3.** Figure 3: Downstream accuracy under varying perturbation sample counts 𝑁 and perturbation scales, expressed as fractions of the optimal value 𝜇 = 0.004, with all other hyperparameters held fixed. Results are reported for Qwen3-4B, where ( ∗ ) indicates the values used in the main experiments. The confidence component (𝐶̄) is clearly the primary driver of the method’s gains: removing it reduces the fraction of sample… view at source ↗

read the original abstract

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a test-time embedding tweak for ICL driven by a log-prob proxy on the demos, but the evidence that this actually helps on unseen inputs rather than just the fixed examples is still thin.

read the letter

The new piece is formalizing the model's own log probabilities on the demonstrated outputs as a bounded self-supervised proxy, then using zeroth-order optimization to adjust the continuous prompt embeddings at test time. This needs only one forward pass, no generation, no labels, and no external data, so it applies to both classification and free-form generation. That setup is cleaner than many prior calibration tricks that are limited to closed label sets.

Referee Report

2 major / 1 minor

Summary. The paper proposes optimizing the continuous embeddings of a fixed few-shot prompt at test time to maximize a bounded self-supervised confidence proxy based on the model's log-probabilities assigned to the demonstrated outputs (available from a single forward pass). This zeroth-order optimization yields a calibration procedure requiring no finetuning, token generation, predefined label sets, or external data, applicable to both classification and free-form generation. Across ICL tasks the method is reported to match or improve base-model performance, outperform classification-specific baselines on most tasks, and exhibit a statistically significant correlation between proxy improvement and downstream accuracy gains.

Significance. If the central claim holds, the work supplies a lightweight, general test-time adaptation technique for ICL that relies solely on the model's internal signals and extends to generation tasks. The reported correlation supplies empirical grounding for the proxy as an optimization signal. No machine-checked proofs or parameter-free derivations are claimed, but the absence of external data or generation steps is a practical strength.

major comments (2)

[Abstract] Abstract: the claim that the statistically significant correlation 'confirms that the proposed proxy encodes a reliable optimization signal' does not address whether maximizing the demonstration log-prob proxy changes conditional behavior on unseen test inputs or merely increases probability mass on the fixed demonstration tokens; this distinction is load-bearing for the generalization claim.
[Method] Method (zeroth-order optimization description): because the objective is defined exclusively on the fixed demonstrations and never observes test inputs, the manuscript must supply analysis or controls showing that embedding adjustments alter predictions outside the demonstration set rather than overfitting the demonstrated outputs; the current correlation evidence alone leaves this open.

minor comments (1)

Notation for the bounded proxy could be introduced with an explicit equation rather than prose description to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the generalization properties of our test-time optimization procedure. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the statistically significant correlation 'confirms that the proposed proxy encodes a reliable optimization signal' does not address whether maximizing the demonstration log-prob proxy changes conditional behavior on unseen test inputs or merely increases probability mass on the fixed demonstration tokens; this distinction is load-bearing for the generalization claim.

Authors: We agree the abstract phrasing should be more precise on this point. All downstream accuracy results are measured on held-out test inputs never seen during optimization, and the reported correlation is specifically between proxy gains on the demonstrations and accuracy improvements on those unseen test examples. This already provides evidence that the embedding adjustments affect conditional behavior beyond the fixed demonstrations. We will revise the abstract to explicitly state that the correlation is with test-set accuracy gains, thereby underscoring the generalization aspect. revision: yes
Referee: [Method] Method (zeroth-order optimization description): because the objective is defined exclusively on the fixed demonstrations and never observes test inputs, the manuscript must supply analysis or controls showing that embedding adjustments alter predictions outside the demonstration set rather than overfitting the demonstrated outputs; the current correlation evidence alone leaves this open.

Authors: We concur that dedicated controls would strengthen the presentation. While the statistically significant correlation with test accuracy (measured on inputs outside the optimization set) already indicates that the adjustments influence predictions on unseen data, we will add a new analysis subsection. This will include quantitative comparisons of model output distributions on test inputs before versus after optimization, along with discussion of the bounded, small-magnitude nature of the zeroth-order updates to address potential overfitting concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external test accuracy

full rationale

The central procedure optimizes prompt embeddings to maximize a log-probability proxy computed solely on the fixed demonstration outputs. This proxy is explicitly defined from the model's forward pass on those demonstrations, but the claimed benefit is measured on independent test inputs whose labels are never observed during optimization. The reported statistically significant correlation between proxy improvement and downstream accuracy gain constitutes an external empirical check rather than a definitional reduction. No equations equate the optimized proxy directly to test accuracy by construction, no self-citation chains bear the load, and no uniqueness theorems or ansatzes are smuggled in. The method is therefore not forced to succeed; any observed lift on unseen data stands as a genuine empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the log-prob signal is informative for task inference and on the modeling choice to treat prompt embeddings as continuous optimizable variables; no explicit free parameters or new invented entities beyond the proxy itself are stated.

axioms (1)

domain assumption Log-probabilities assigned to demonstrated outputs provide a meaningful signal for how well the model has inferred the task
Stated as the key observation enabling the proxy; central to the optimization signal.

invented entities (1)

bounded self-supervised confidence proxy no independent evidence
purpose: To serve as an optimizable signal derived from demonstration log-probabilities
Formalized in the paper from the observed log-prob signal; no independent evidence outside the method itself.

pith-pipeline@v0.9.0 · 5702 in / 1354 out tokens · 40905 ms · 2026-05-25T04:53:48.050943+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 5 internal anchors

[1]

Rahul Atul Bhope, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venkatasubramanian. OptiSeq: Ordering examples on-the-fly for in-context learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2486...

work page 2025
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[3]

ICLEval: Evaluating in-context learning ability of large language models

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, YanTao Jia, Zhao Cao, and Ji-Rong Wen. ICLEval: Evaluating in-context learning ability of large language models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Lingui...

work page 2025
[4]

Token- based decision criteria are suboptimal in in-context learning

Hakaze Cho, Yoshihiro Sakai, Mariko Kato, Kenshiro Tanaka, Akira Ishii, and Naoya Inoue. Token- based decision criteria are suboptimal in in-context learning. In Luis Chiruzzo, Alan Ritter, and 12 Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

work page doi:10.18653/v1/2025.naacl-long.278 2025
[5]

Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada, July 202...

work page doi:10.18653/v1/2023.findings-acl.247 2023
[7]

Complexity-based prompting for multi-step reasoning, 2023

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning, 2023. URLhttps://arxiv.org/abs/2210.00720

work page arXiv 2023
[8]

Variance- reduced zeroth-order methods for fine-tuning language models

Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance- reduced zeroth-order methods for fine-tuning language models. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=VHO4nE7v41

work page 2024
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023. URLhttps://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

What makes a good order of examples in in-context learning

Qi Guo, Leiyu Wang, Yidong Wang, Wei Ye, and Shikun Zhang. What makes a good order of examples in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 14892–14904, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/...

work page doi:10.18653/v1/2024.findings-acl.884 2024
[12]

Prototypical calibration for few-shot learning of language models

Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, and Furu Wei. Prototypical calibration for few-shot learning of language models. InThe Eleventh International Conference on Learning Representations,

work page
[13]

URLhttps://openreview.net/forum?id=nUsP9lFADUF

work page
[14]

In-context learning creates task vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, Singapore, December 2023. Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.624. URL https://aclan...

work page doi:10.18653/v1/2023.findings-emnlp.624 2023
[15]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[16]

Surface form competition: Why the highest probability answer isn’t always right

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and...

work page doi:10.18653/v1/2021.emnlp-main.564 2021
[17]

Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022

Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo, and Sang goo Lee. Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022. URLhttps://arxiv.org/abs/2206.08082. 15

work page arXiv 2022
[18]

Answer-level calibration for free-form multiple choice question answering

Sawan Kumar. Answer-level calibration for free-form multiple choice question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 665–679, Dublin, Ireland, May 2022. Association for Computational Linguistics. do...

work page doi:10.18653/v1/2022.acl-long.49 2022
[19]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[20]

Diverse demonstrations improve in-context compositional generalization

Itay Levy, Ben Bogin, and Jonathan Berant. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1401– 1422, Toronto, Canada, July 2023. Association for Com...

work page doi:10.18653/v1/2023.acl- 2023
[21]

Finding support examples for in-context learning

Xiaonan Li and Xipeng Qiu. Finding support examples for in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6219–6235, Singapore, December 2023. Association for Computational Linguistics. doi: 10. 18653/v1/2023.findings-emnlp.411. URL https://aclanthology.o...

work page 2023
[22]

Task calibration: Calibrating large language models on inference tasks

Yingjie Li, Yun Luo, Xiaotian Xie, and Yue Zhang. Task calibration: Calibrating large language models on inference tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6937–6951, Vienna, Austria, July 2025. Association for Computational Lin...

work page doi:10.18653/v1/2025.findings-acl.362 2025
[23]

𝑠𝑒2: Sequential example selection for in-context learning

Haoyu Liu, Jianfeng Liu, Shaohan Huang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Furu Wei, and Qi Zhang. 𝑠𝑒2: Sequential example selection for in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5262–5284, Bangkok, Thailand, August 2024. Association for Comput...

work page doi:10.18653/v1/2024.findings-acl.312 2024
[24]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić, editors, Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, D...

work page 2022
[25]

doi: 10.18653/v1/2022.deelio-1.10

Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https: //aclanthology.org/2022.deelio-1.10/

work page doi:10.18653/v1/2022.deelio-1.10 2022
[26]

Sheng Liu, Haotian Ye, Lei Xing, and James Y. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine L...

work page 2024
[27]

Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InThe Thirty-ninth Annual 16 Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=Tjw0ACu3NL

work page 2025
[28]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

work page doi:10.18653/v1/2022.acl-long.556 2022
[29]

Z-ICL: Zero-shot in-context learning with pseudo-demonstrations

Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, and Hannaneh Hajishirzi. Z-ICL: Zero-shot in-context learning with pseudo-demonstrations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2304–2317, Toronto, Canada, July...

work page doi:10.18653/v1/2023.acl-long.129 2023
[30]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Vota6rFhBQ

work page 2023
[31]

Noisy channel language model prompting for few-shot text classification

Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland, Ma...

work page doi:10.18653/v1/2022.acl-long.365 2022
[32]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhab...

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[33]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. doi: 10.1007/s10208-015-9296-2. URL https://doi.org/10.1007/s10208-015-9296-2

work page doi:10.1007/s10208-015-9296-2 2017
[34]

Revisiting demonstration selection strategies in in-context learning

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. Revisiting demonstration selection strategies in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090–9101, Bangk...

work page
[35]

doi: 10.18653/v1/2024.acl-long.492

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.492. URL https: //aclanthology.org/2024.acl-long.492/

work page doi:10.18653/v1/2024.acl-long.492 2024
[36]

Rapid selection and ordering of in- context demonstrations via prompt embedding clustering

Kha Pham, Hung Le, Man Ngo, and Truyen Tran. Rapid selection and ordering of in- context demonstrations via prompt embedding clustering. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 43540–43556, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 6c2745a8e20...

work page 2025
[38]

Language models are unsupervised multitask learners.OpenAI Blog, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019

work page 2019
[39]

Test-time detoxification without training or learning anything, 2026

Baturay Saglam and Dionysis Kalogerias. Test-time detoxification without training or learning anything, 2026. URLhttps://arxiv.org/abs/2602.02498

work page arXiv 2026
[40]

Test-Time Safety Alignment

Baturay Saglam and Dionysis Kalogerias. Test-time safety alignment, 2026. URL https://arxiv. org/abs/2604.26167

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Learning task representations from in-context learning

Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, and Amin Karbasi. Learning task representations from in-context learning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6634–6663, Vienna, Austria, July 2025. Association for Co...

work page doi:10.18653/v1/2025.findings-acl.345 2025
[42]

Smith, and Tao Yu

Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Selective annotation makes language models better few-shot learners. InThe Eleventh International Conference on Learning Representations,

work page
[43]

URLhttps://openreview.net/forum?id=qY1hlv7gwg

work page
[44]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=AwyxtyMwaG. arXiv:2310.15213

work page arXiv 2024
[46]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL htt...

work page 2017
[47]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machi...

work page 2023
[48]

Better zero-shot reasoning with self-adaptive prompting

Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan Arik, and Tomas Pfister. Better zero-shot reasoning with self-adaptive prompting. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.1...

work page doi:10.18653/v1/2023.findings-acl.216 2023
[49]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Eisenschlos, Sercan Arik, and Tomas Pfister. Universal self-adaptive prompting. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7437– 7462, Singapore, December 2023. Association for Computational ...

work page doi:10.18653/v1/2023 2023
[50]

Label words are anchors: An information flow perspective for understanding in-context learning

Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore, Dece...

work page doi:10.18653/v1/2023.emnlp-main.609 2023
[51]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, 19 Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-...

work page 2020
[52]

Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering

Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2023.acl-long.79 2023
[53]

$k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference

Benfeng Xu, Quan Wang, Zhendong Mao, Yajuan Lyu, Qiaoqiao She, and Yongdong Zhang. $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=fe2S7736sNS

work page 2023
[54]

Misconfidence-based demonstration selection for llm in-context learning, 2024

Shangqing Xu and Chao Zhang. Misconfidence-based demonstration selection for llm in-context learning, 2024. URLhttps://arxiv.org/abs/2401.06301

work page arXiv 2024
[55]

In-context example ordering guided by label distributions

Zhichao Xu, Daniel Cohen, Bei Wang, and Vivek Srikumar. In-context example ordering guided by label distributions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2623–2640, Mexico City, Mexico, June

work page 2024
[56]

doi: 10.18653/v1/2024.findings-naacl.167

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.167. URL https://aclanthology.org/2024.findings-naacl.167/

work page doi:10.18653/v1/2024.findings-naacl.167 2024
[57]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Representative demonstra- tion selection for in-context learning with two-stage determinantal point process

Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Cao Liu, Jun Zhao, and Kang Liu. Representative demonstra- tion selection for in-context learning with two-stage determinantal point process. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5443–5456, Singapore, Decembe...

work page doi:10.18653/v1/2023.emnlp-main.331 2023
[59]

Ground-truth labels matter: A deeper look into input-label demonstrations

Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu D...

work page doi:10.18653/v1/2022.emnlp-main.155 2022
[60]

Unlocking black-box prompt tuning efficiency via zeroth-order optimization

Heshen Zhan, Congliang Chen, Tian Ding, Ziniu Li, and Ruoyu Sun. Unlocking black-box prompt tuning efficiency via zeroth-order optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14825– 14838, Miami, Florida, USA, November 2024. Association for Computation...

work page 2024
[61]

Batch-ICL: Effective, efficient, and order-agnostic in-context learning

Kaiyi Zhang, Ang Lv, Yuhan Chen, Hansen Ha, Tao Xu, and Rui Yan. Batch-ICL: Effective, efficient, and order-agnostic in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 10728–10739, Bangkok, Thailand, August 2024. Association for Computational Linguistic...

work page doi:10.18653/v1/2024.findings- 2024
[62]

Dpzero: private fine-tuning of language models without backpropagation

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. Dpzero: private fine-tuning of language models without backpropagation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[63]

D.Va: Validate your demonstration first before you use it

Qi Zhang, Zhiqing Xiao, Ruixuan Xiao, Lirong Gao, and Junbo Zhao. D.Va: Validate your demonstration first before you use it. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2580–2594, Vienna, Austri...

work page doi:10.18653/v1/2025.acl-long.129 2025
[64]

COME: Test-time adaption by conservatively minimizing entropy

Qingyang Zhang, Yatao Bian, Xinke Kong, Peilin Zhao, and Changqing Zhang. COME: Test-time adaption by conservatively minimizing entropy. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=506BjJ1ziZ

work page 2025
[65]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 18–24 Jul 2021. URL https://p...

work page 2021
[66]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. InThe Twelfth International Conference on Learning Representations,

work page
[67]

URLhttps://openreview.net/forum?id=shr9PXz7T0

work page
[68]

Clf. ”) tasks have a fixed label space; generation tasks require open-ended output. Dictionary (“Dict

Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=L3FHMoKZcS. 21 A Applicability of Existing Test-Time Met...

work page 2024

[1] [1]

Rahul Atul Bhope, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venkatasubramanian. OptiSeq: Ordering examples on-the-fly for in-context learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2486...

work page 2025

[2] [2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020

[3] [3]

ICLEval: Evaluating in-context learning ability of large language models

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, YanTao Jia, Zhao Cao, and Ji-Rong Wen. ICLEval: Evaluating in-context learning ability of large language models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Lingui...

work page 2025

[4] [4]

Token- based decision criteria are suboptimal in in-context learning

Hakaze Cho, Yoshihiro Sakai, Mariko Kato, Kenshiro Tanaka, Akira Ishii, and Naoya Inoue. Token- based decision criteria are suboptimal in in-context learning. In Luis Chiruzzo, Alan Ritter, and 12 Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

work page doi:10.18653/v1/2025.naacl-long.278 2025

[5] [5]

Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada, July 202...

work page doi:10.18653/v1/2023.findings-acl.247 2023

[6] [7]

Complexity-based prompting for multi-step reasoning, 2023

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning, 2023. URLhttps://arxiv.org/abs/2210.00720

work page arXiv 2023

[7] [8]

Variance- reduced zeroth-order methods for fine-tuning language models

Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance- reduced zeroth-order methods for fine-tuning language models. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=VHO4nE7v41

work page 2024

[8] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [10]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023. URLhttps://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [11]

What makes a good order of examples in in-context learning

Qi Guo, Leiyu Wang, Yidong Wang, Wei Ye, and Shikun Zhang. What makes a good order of examples in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 14892–14904, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/...

work page doi:10.18653/v1/2024.findings-acl.884 2024

[11] [12]

Prototypical calibration for few-shot learning of language models

Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, and Furu Wei. Prototypical calibration for few-shot learning of language models. InThe Eleventh International Conference on Learning Representations,

work page

[12] [13]

URLhttps://openreview.net/forum?id=nUsP9lFADUF

work page

[13] [14]

In-context learning creates task vectors

Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, Singapore, December 2023. Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.624. URL https://aclan...

work page doi:10.18653/v1/2023.findings-emnlp.624 2023

[14] [15]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[15] [16]

Surface form competition: Why the highest probability answer isn’t always right

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and...

work page doi:10.18653/v1/2021.emnlp-main.564 2021

[16] [17]

Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022

Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo, and Sang goo Lee. Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022. URLhttps://arxiv.org/abs/2206.08082. 15

work page arXiv 2022

[17] [18]

Answer-level calibration for free-form multiple choice question answering

Sawan Kumar. Answer-level calibration for free-form multiple choice question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 665–679, Dublin, Ireland, May 2022. Association for Computational Linguistics. do...

work page doi:10.18653/v1/2022.acl-long.49 2022

[18] [19]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[19] [20]

Diverse demonstrations improve in-context compositional generalization

Itay Levy, Ben Bogin, and Jonathan Berant. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1401– 1422, Toronto, Canada, July 2023. Association for Com...

work page doi:10.18653/v1/2023.acl- 2023

[20] [21]

Finding support examples for in-context learning

Xiaonan Li and Xipeng Qiu. Finding support examples for in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6219–6235, Singapore, December 2023. Association for Computational Linguistics. doi: 10. 18653/v1/2023.findings-emnlp.411. URL https://aclanthology.o...

work page 2023

[21] [22]

Task calibration: Calibrating large language models on inference tasks

Yingjie Li, Yun Luo, Xiaotian Xie, and Yue Zhang. Task calibration: Calibrating large language models on inference tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6937–6951, Vienna, Austria, July 2025. Association for Computational Lin...

work page doi:10.18653/v1/2025.findings-acl.362 2025

[22] [23]

𝑠𝑒2: Sequential example selection for in-context learning

Haoyu Liu, Jianfeng Liu, Shaohan Huang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Furu Wei, and Qi Zhang. 𝑠𝑒2: Sequential example selection for in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5262–5284, Bangkok, Thailand, August 2024. Association for Comput...

work page doi:10.18653/v1/2024.findings-acl.312 2024

[23] [24]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić, editors, Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, D...

work page 2022

[24] [25]

doi: 10.18653/v1/2022.deelio-1.10

Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https: //aclanthology.org/2022.deelio-1.10/

work page doi:10.18653/v1/2022.deelio-1.10 2022

[25] [26]

Sheng Liu, Haotian Ye, Lei Xing, and James Y. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine L...

work page 2024

[26] [27]

Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InThe Thirty-ninth Annual 16 Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=Tjw0ACu3NL

work page 2025

[27] [28]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

work page doi:10.18653/v1/2022.acl-long.556 2022

[28] [29]

Z-ICL: Zero-shot in-context learning with pseudo-demonstrations

Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, and Hannaneh Hajishirzi. Z-ICL: Zero-shot in-context learning with pseudo-demonstrations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2304–2317, Toronto, Canada, July...

work page doi:10.18653/v1/2023.acl-long.129 2023

[29] [30]

Lee, Danqi Chen, and Sanjeev Arora

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Vota6rFhBQ

work page 2023

[30] [31]

Noisy channel language model prompting for few-shot text classification

Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland, Ma...

work page doi:10.18653/v1/2022.acl-long.365 2022

[31] [32]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhab...

work page doi:10.18653/v1/2022.emnlp-main.759 2022

[32] [33]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. doi: 10.1007/s10208-015-9296-2. URL https://doi.org/10.1007/s10208-015-9296-2

work page doi:10.1007/s10208-015-9296-2 2017

[33] [34]

Revisiting demonstration selection strategies in in-context learning

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. Revisiting demonstration selection strategies in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090–9101, Bangk...

work page

[34] [35]

doi: 10.18653/v1/2024.acl-long.492

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.492. URL https: //aclanthology.org/2024.acl-long.492/

work page doi:10.18653/v1/2024.acl-long.492 2024

[35] [36]

Rapid selection and ordering of in- context demonstrations via prompt embedding clustering

Kha Pham, Hung Le, Man Ngo, and Truyen Tran. Rapid selection and ordering of in- context demonstrations via prompt embedding clustering. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 43540–43556, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 6c2745a8e20...

work page 2025

[36] [38]

Language models are unsupervised multitask learners.OpenAI Blog, 2019

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019

work page 2019

[37] [39]

Test-time detoxification without training or learning anything, 2026

Baturay Saglam and Dionysis Kalogerias. Test-time detoxification without training or learning anything, 2026. URLhttps://arxiv.org/abs/2602.02498

work page arXiv 2026

[38] [40]

Test-Time Safety Alignment

Baturay Saglam and Dionysis Kalogerias. Test-time safety alignment, 2026. URL https://arxiv. org/abs/2604.26167

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [41]

Learning task representations from in-context learning

Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, and Amin Karbasi. Learning task representations from in-context learning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6634–6663, Vienna, Austria, July 2025. Association for Co...

work page doi:10.18653/v1/2025.findings-acl.345 2025

[40] [42]

Smith, and Tao Yu

Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Selective annotation makes language models better few-shot learners. InThe Eleventh International Conference on Learning Representations,

work page

[41] [43]

URLhttps://openreview.net/forum?id=qY1hlv7gwg

work page

[42] [44]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [45]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=AwyxtyMwaG. arXiv:2310.15213

work page arXiv 2024

[44] [46]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL htt...

work page 2017

[45] [47]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machi...

work page 2023

[46] [48]

Better zero-shot reasoning with self-adaptive prompting

Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan Arik, and Tomas Pfister. Better zero-shot reasoning with self-adaptive prompting. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.1...

work page doi:10.18653/v1/2023.findings-acl.216 2023

[47] [49]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Eisenschlos, Sercan Arik, and Tomas Pfister. Universal self-adaptive prompting. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7437– 7462, Singapore, December 2023. Association for Computational ...

work page doi:10.18653/v1/2023 2023

[48] [50]

Label words are anchors: An information flow perspective for understanding in-context learning

Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore, Dece...

work page doi:10.18653/v1/2023.emnlp-main.609 2023

[49] [51]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, 19 Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-...

work page 2020

[50] [52]

Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering

Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2023.acl-long.79 2023

[51] [53]

$k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference

Benfeng Xu, Quan Wang, Zhendong Mao, Yajuan Lyu, Qiaoqiao She, and Yongdong Zhang. $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=fe2S7736sNS

work page 2023

[52] [54]

Misconfidence-based demonstration selection for llm in-context learning, 2024

Shangqing Xu and Chao Zhang. Misconfidence-based demonstration selection for llm in-context learning, 2024. URLhttps://arxiv.org/abs/2401.06301

work page arXiv 2024

[53] [55]

In-context example ordering guided by label distributions

Zhichao Xu, Daniel Cohen, Bei Wang, and Vivek Srikumar. In-context example ordering guided by label distributions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2623–2640, Mexico City, Mexico, June

work page 2024

[54] [56]

doi: 10.18653/v1/2024.findings-naacl.167

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.167. URL https://aclanthology.org/2024.findings-naacl.167/

work page doi:10.18653/v1/2024.findings-naacl.167 2024

[55] [57]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [58]

Representative demonstra- tion selection for in-context learning with two-stage determinantal point process

Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Cao Liu, Jun Zhao, and Kang Liu. Representative demonstra- tion selection for in-context learning with two-stage determinantal point process. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5443–5456, Singapore, Decembe...

work page doi:10.18653/v1/2023.emnlp-main.331 2023

[57] [59]

Ground-truth labels matter: A deeper look into input-label demonstrations

Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu D...

work page doi:10.18653/v1/2022.emnlp-main.155 2022

[58] [60]

Unlocking black-box prompt tuning efficiency via zeroth-order optimization

Heshen Zhan, Congliang Chen, Tian Ding, Ziniu Li, and Ruoyu Sun. Unlocking black-box prompt tuning efficiency via zeroth-order optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14825– 14838, Miami, Florida, USA, November 2024. Association for Computation...

work page 2024

[59] [61]

Batch-ICL: Effective, efficient, and order-agnostic in-context learning

Kaiyi Zhang, Ang Lv, Yuhan Chen, Hansen Ha, Tao Xu, and Rui Yan. Batch-ICL: Effective, efficient, and order-agnostic in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 10728–10739, Bangkok, Thailand, August 2024. Association for Computational Linguistic...

work page doi:10.18653/v1/2024.findings- 2024

[60] [62]

Dpzero: private fine-tuning of language models without backpropagation

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. Dpzero: private fine-tuning of language models without backpropagation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[61] [63]

D.Va: Validate your demonstration first before you use it

Qi Zhang, Zhiqing Xiao, Ruixuan Xiao, Lirong Gao, and Junbo Zhao. D.Va: Validate your demonstration first before you use it. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2580–2594, Vienna, Austri...

work page doi:10.18653/v1/2025.acl-long.129 2025

[62] [64]

COME: Test-time adaption by conservatively minimizing entropy

Qingyang Zhang, Yatao Bian, Xinke Kong, Peilin Zhao, and Changqing Zhang. COME: Test-time adaption by conservatively minimizing entropy. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=506BjJ1ziZ

work page 2025

[63] [65]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 18–24 Jul 2021. URL https://p...

work page 2021

[64] [66]

Large language models are not robust multiple choice selectors

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. InThe Twelfth International Conference on Learning Representations,

work page

[65] [67]

URLhttps://openreview.net/forum?id=shr9PXz7T0

work page

[66] [68]

Clf. ”) tasks have a fixed label space; generation tasks require open-ended output. Dictionary (“Dict

Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=L3FHMoKZcS. 21 A Applicability of Existing Test-Time Met...

work page 2024