pith. sign in

arxiv: 2605.23180 · v1 · pith:BJSFZV23new · submitted 2026-05-22 · 💻 cs.CL · cs.LG

Self-Improving In-Context Learning

Pith reviewed 2026-05-25 04:53 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords in-context learningprompt optimizationtest-time calibrationself-supervised proxyzeroth-order optimizationfew-shot prompting
0
0 comments X

The pith

Optimizing the continuous embeddings of a fixed few-shot prompt at test time improves in-context learning by maximizing a log-probability proxy on the demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the log-probabilities a model assigns to the outputs in its own few-shot demonstrations form a usable signal for how well the model has inferred the task. This signal is turned into a bounded self-supervised confidence proxy that can be maximized at test time through zeroth-order optimization of the prompt embeddings. The procedure requires no finetuning, no token generation, no external data, and works for both classification and free-form generation. Experiments across a suite of ICL tasks show the calibrated prompts match or beat the base model and outperform classification-specific baselines on most tasks, with a statistically significant correlation between proxy gains and accuracy gains.

Core claim

The central claim is that the log-probabilities assigned to demonstrated outputs, available from a single forward pass, constitute a reliable optimization signal for in-context learning; maximizing a formal bounded confidence proxy derived from them via zeroth-order search over prompt embeddings yields better task performance on unseen inputs from the same fixed demonstrations.

What carries the argument

A bounded self-supervised confidence proxy derived from the log-probabilities of demonstrated outputs, maximized over continuous prompt embeddings via zeroth-order optimization.

If this is right

  • The calibration procedure matches or improves base-model performance across a range of ICL tasks.
  • It outperforms classification-specific baselines on most evaluated tasks.
  • Statistically significant correlation exists between improvement in the proxy and gains in downstream accuracy.
  • The same procedure applies without modification to both classification and free-form generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on whether the same proxy can guide selection or reweighting of which demonstrations to include in the prompt.
  • It implies that prompt embeddings can be treated as continuous parameters for inference-time adaptation even when the underlying model weights stay frozen.
  • Real-time deployment on streaming inputs might become feasible if the zeroth-order steps can be limited to a small fixed budget per query.

Load-bearing premise

Maximizing the log-probability proxy computed on the fixed demonstrations will produce better predictions on unseen test inputs rather than merely fitting the demonstrations more closely.

What would settle it

If optimizing the proxy produces no corresponding increase (or produces a decrease) in accuracy on held-out test examples, while the base unoptimized prompt remains unchanged, the claim that the proxy encodes a reliable downstream signal would be falsified.

Figures

Figures reproduced from arXiv: 2605.23180 by Baturay Saglam, Dionysis Kalogerias.

Figure 1
Figure 1. Figure 1: Proxy improvement versus accuracy improvement across all 12 ICLEval tasks. A one-sided Spearman rank correlation test (𝐻1 ∶ 𝜌 > 0) yields a statistically significant positive association across all models and tasks combined. (C) (R) (G) Format Check Order Adj. De-duplication Count & Nav. Rel. Analysis Format Conv. Str. Completion List Map. Format Cloning Dup. Check Dict. Search Order Check -0.19 -0.08 -0.0… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation studies of the proposed three-component proxy (𝛼=0.6, 𝛽=0.3, 𝛾=0.1) and the perturbation domain used in end-to-end calibration. (a) Each proxy component is ablated by setting its coefficient to zero and renormalizing the remaining weights. (b) Accuracy difference between the default perturbation domain (demonstration embeddings only) and full-sequence perturbation that includes query positions. Re… view at source ↗
Figure 3
Figure 3. Figure 3: Downstream accuracy under varying perturbation sample counts 𝑁 and perturbation scales, expressed as fractions of the optimal value 𝜇 = 0.004, with all other hyperparameters held fixed. Results are reported for Qwen3-4B, where ( ∗ ) indicates the values used in the main experiments. The confidence component (𝐶̄) is clearly the primary driver of the method’s gains: removing it reduces the fraction of sample… view at source ↗
read the original abstract

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes optimizing the continuous embeddings of a fixed few-shot prompt at test time to maximize a bounded self-supervised confidence proxy based on the model's log-probabilities assigned to the demonstrated outputs (available from a single forward pass). This zeroth-order optimization yields a calibration procedure requiring no finetuning, token generation, predefined label sets, or external data, applicable to both classification and free-form generation. Across ICL tasks the method is reported to match or improve base-model performance, outperform classification-specific baselines on most tasks, and exhibit a statistically significant correlation between proxy improvement and downstream accuracy gains.

Significance. If the central claim holds, the work supplies a lightweight, general test-time adaptation technique for ICL that relies solely on the model's internal signals and extends to generation tasks. The reported correlation supplies empirical grounding for the proxy as an optimization signal. No machine-checked proofs or parameter-free derivations are claimed, but the absence of external data or generation steps is a practical strength.

major comments (2)
  1. [Abstract] Abstract: the claim that the statistically significant correlation 'confirms that the proposed proxy encodes a reliable optimization signal' does not address whether maximizing the demonstration log-prob proxy changes conditional behavior on unseen test inputs or merely increases probability mass on the fixed demonstration tokens; this distinction is load-bearing for the generalization claim.
  2. [Method] Method (zeroth-order optimization description): because the objective is defined exclusively on the fixed demonstrations and never observes test inputs, the manuscript must supply analysis or controls showing that embedding adjustments alter predictions outside the demonstration set rather than overfitting the demonstrated outputs; the current correlation evidence alone leaves this open.
minor comments (1)
  1. Notation for the bounded proxy could be introduced with an explicit equation rather than prose description to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the generalization properties of our test-time optimization procedure. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the statistically significant correlation 'confirms that the proposed proxy encodes a reliable optimization signal' does not address whether maximizing the demonstration log-prob proxy changes conditional behavior on unseen test inputs or merely increases probability mass on the fixed demonstration tokens; this distinction is load-bearing for the generalization claim.

    Authors: We agree the abstract phrasing should be more precise on this point. All downstream accuracy results are measured on held-out test inputs never seen during optimization, and the reported correlation is specifically between proxy gains on the demonstrations and accuracy improvements on those unseen test examples. This already provides evidence that the embedding adjustments affect conditional behavior beyond the fixed demonstrations. We will revise the abstract to explicitly state that the correlation is with test-set accuracy gains, thereby underscoring the generalization aspect. revision: yes

  2. Referee: [Method] Method (zeroth-order optimization description): because the objective is defined exclusively on the fixed demonstrations and never observes test inputs, the manuscript must supply analysis or controls showing that embedding adjustments alter predictions outside the demonstration set rather than overfitting the demonstrated outputs; the current correlation evidence alone leaves this open.

    Authors: We concur that dedicated controls would strengthen the presentation. While the statistically significant correlation with test accuracy (measured on inputs outside the optimization set) already indicates that the adjustments influence predictions on unseen data, we will add a new analysis subsection. This will include quantitative comparisons of model output distributions on test inputs before versus after optimization, along with discussion of the bounded, small-magnitude nature of the zeroth-order updates to address potential overfitting concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external test accuracy

full rationale

The central procedure optimizes prompt embeddings to maximize a log-probability proxy computed solely on the fixed demonstration outputs. This proxy is explicitly defined from the model's forward pass on those demonstrations, but the claimed benefit is measured on independent test inputs whose labels are never observed during optimization. The reported statistically significant correlation between proxy improvement and downstream accuracy gain constitutes an external empirical check rather than a definitional reduction. No equations equate the optimized proxy directly to test accuracy by construction, no self-citation chains bear the load, and no uniqueness theorems or ansatzes are smuggled in. The method is therefore not forced to succeed; any observed lift on unseen data stands as a genuine empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the log-prob signal is informative for task inference and on the modeling choice to treat prompt embeddings as continuous optimizable variables; no explicit free parameters or new invented entities beyond the proxy itself are stated.

axioms (1)
  • domain assumption Log-probabilities assigned to demonstrated outputs provide a meaningful signal for how well the model has inferred the task
    Stated as the key observation enabling the proxy; central to the optimization signal.
invented entities (1)
  • bounded self-supervised confidence proxy no independent evidence
    purpose: To serve as an optimizable signal derived from demonstration log-probabilities
    Formalized in the paper from the observed log-prob signal; no independent evidence outside the method itself.

pith-pipeline@v0.9.0 · 5702 in / 1354 out tokens · 40905 ms · 2026-05-25T04:53:48.050943+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 5 internal anchors

  1. [1]

    Rahul Atul Bhope, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venkatasubramanian. OptiSeq: Ordering examples on-the-fly for in-context learning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2486...

  2. [2]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  3. [3]

    ICLEval: Evaluating in-context learning ability of large language models

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, YanTao Jia, Zhao Cao, and Ji-Rong Wen. ICLEval: Evaluating in-context learning ability of large language models. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Lingui...

  4. [4]

    Token- based decision criteria are suboptimal in in-context learning

    Hakaze Cho, Yoshihiro Sakai, Mariko Kato, Kenshiro Tanaka, Akira Ishii, and Naoya Inoue. Token- based decision criteria are suboptimal in in-context learning. In Luis Chiruzzo, Alan Ritter, and 12 Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

  5. [5]

    Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Anna Rogers, Jor- dan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, Toronto, Canada, July 202...

  6. [7]

    Complexity-based prompting for multi-step reasoning, 2023

    Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning, 2023. URLhttps://arxiv.org/abs/2210.00720

  7. [8]

    Variance- reduced zeroth-order methods for fine-tuning language models

    Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance- reduced zeroth-order methods for fine-tuning language models. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=VHO4nE7v41

  8. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  9. [10]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023. URLhttps://arxiv.o...

  10. [11]

    What makes a good order of examples in in-context learning

    Qi Guo, Leiyu Wang, Yidong Wang, Wei Ye, and Shikun Zhang. What makes a good order of examples in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 14892–14904, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/...

  11. [12]

    Prototypical calibration for few-shot learning of language models

    Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, and Furu Wei. Prototypical calibration for few-shot learning of language models. InThe Eleventh International Conference on Learning Representations,

  12. [13]

    URLhttps://openreview.net/forum?id=nUsP9lFADUF

  13. [14]

    In-context learning creates task vectors

    Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, Singapore, December 2023. Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-emnlp.624. URL https://aclan...

  14. [15]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  15. [16]

    Surface form competition: Why the highest probability answer isn’t always right

    Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and...

  16. [17]

    Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022

    Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim, Taeuk Kim, Kang Min Yoo, and Sang goo Lee. Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator, 2022. URLhttps://arxiv.org/abs/2206.08082. 15

  17. [18]

    Answer-level calibration for free-form multiple choice question answering

    Sawan Kumar. Answer-level calibration for free-form multiple choice question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 665–679, Dublin, Ireland, May 2022. Association for Computational Linguistics. do...

  18. [19]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  19. [20]

    Diverse demonstrations improve in-context compositional generalization

    Itay Levy, Ben Bogin, and Jonathan Berant. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1401– 1422, Toronto, Canada, July 2023. Association for Com...

  20. [21]

    Finding support examples for in-context learning

    Xiaonan Li and Xipeng Qiu. Finding support examples for in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6219–6235, Singapore, December 2023. Association for Computational Linguistics. doi: 10. 18653/v1/2023.findings-emnlp.411. URL https://aclanthology.o...

  21. [22]

    Task calibration: Calibrating large language models on inference tasks

    Yingjie Li, Yun Luo, Xiaotian Xie, and Yue Zhang. Task calibration: Calibrating large language models on inference tasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6937–6951, Vienna, Austria, July 2025. Association for Computational Lin...

  22. [23]

    𝑠𝑒2: Sequential example selection for in-context learning

    Haoyu Liu, Jianfeng Liu, Shaohan Huang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Furu Wei, and Qi Zhang. 𝑠𝑒2: Sequential example selection for in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5262–5284, Bangkok, Thailand, August 2024. Association for Comput...

  23. [24]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić, editors, Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, D...

  24. [25]

    doi: 10.18653/v1/2022.deelio-1.10

    Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https: //aclanthology.org/2022.deelio-1.10/

  25. [26]

    Sheng Liu, Haotian Ye, Lei Xing, and James Y. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine L...

  26. [27]

    Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse meZO: Less parameters for better performance in zeroth-order LLM fine-tuning. InThe Thirty-ninth Annual 16 Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=Tjw0ACu3NL

  27. [28]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)...

  28. [29]

    Z-ICL: Zero-shot in-context learning with pseudo-demonstrations

    Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, and Hannaneh Hajishirzi. Z-ICL: Zero-shot in-context learning with pseudo-demonstrations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2304–2317, Toronto, Canada, July...

  29. [30]

    Lee, Danqi Chen, and Sanjeev Arora

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=Vota6rFhBQ

  30. [31]

    Noisy channel language model prompting for few-shot text classification

    Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland, Ma...

  31. [32]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhab...

  32. [33]

    Random gradient-free minimization of convex functions

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017. doi: 10.1007/s10208-015-9296-2. URL https://doi.org/10.1007/s10208-015-9296-2

  33. [34]

    Revisiting demonstration selection strategies in in-context learning

    Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. Revisiting demonstration selection strategies in in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090–9101, Bangk...

  34. [35]

    doi: 10.18653/v1/2024.acl-long.492

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.492. URL https: //aclanthology.org/2024.acl-long.492/

  35. [36]

    Rapid selection and ordering of in- context demonstrations via prompt embedding clustering

    Kha Pham, Hung Le, Man Ngo, and Truyen Tran. Rapid selection and ordering of in- context demonstrations via prompt embedding clustering. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 43540–43556, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/file/ 6c2745a8e20...

  36. [38]

    Language models are unsupervised multitask learners.OpenAI Blog, 2019

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 2019

  37. [39]

    Test-time detoxification without training or learning anything, 2026

    Baturay Saglam and Dionysis Kalogerias. Test-time detoxification without training or learning anything, 2026. URLhttps://arxiv.org/abs/2602.02498

  38. [40]

    Test-Time Safety Alignment

    Baturay Saglam and Dionysis Kalogerias. Test-time safety alignment, 2026. URL https://arxiv. org/abs/2604.26167

  39. [41]

    Learning task representations from in-context learning

    Baturay Saglam, Xinyang Hu, Zhuoran Yang, Dionysis Kalogerias, and Amin Karbasi. Learning task representations from in-context learning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 6634–6663, Vienna, Austria, July 2025. Association for Co...

  40. [42]

    Smith, and Tao Yu

    Hongjin SU, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. Selective annotation makes language models better few-shot learners. InThe Eleventh International Conference on Learning Representations,

  41. [43]

    URLhttps://openreview.net/forum?id=qY1hlv7gwg

  42. [44]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

  43. [45]

    Li, Arnab Sen Sharma, Aaron Mueller, Byron C

    Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=AwyxtyMwaG. arXiv:2310.15213

  44. [46]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL htt...

  45. [47]

    Transformers learn in-context by gradient descent

    Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machi...

  46. [48]

    Better zero-shot reasoning with self-adaptive prompting

    Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan Arik, and Tomas Pfister. Better zero-shot reasoning with self-adaptive prompting. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.1...

  47. [49]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Eisenschlos, Sercan Arik, and Tomas Pfister. Universal self-adaptive prompting. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7437– 7462, Singapore, December 2023. Association for Computational ...

  48. [50]

    Label words are anchors: An information flow perspective for understanding in-context learning

    Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore, Dece...

  49. [51]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, 19 Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-...

  50. [52]

    Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering

    Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

  51. [53]

    $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference

    Benfeng Xu, Quan Wang, Zhendong Mao, Yajuan Lyu, Qiaoqiao She, and Yongdong Zhang. $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=fe2S7736sNS

  52. [54]

    Misconfidence-based demonstration selection for llm in-context learning, 2024

    Shangqing Xu and Chao Zhang. Misconfidence-based demonstration selection for llm in-context learning, 2024. URLhttps://arxiv.org/abs/2401.06301

  53. [55]

    In-context example ordering guided by label distributions

    Zhichao Xu, Daniel Cohen, Bei Wang, and Vivek Srikumar. In-context example ordering guided by label distributions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2623–2640, Mexico City, Mexico, June

  54. [56]

    doi: 10.18653/v1/2024.findings-naacl.167

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.167. URL https://aclanthology.org/2024.findings-naacl.167/

  55. [57]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  56. [58]

    Representative demonstra- tion selection for in-context learning with two-stage determinantal point process

    Zhao Yang, Yuanzhe Zhang, Dianbo Sui, Cao Liu, Jun Zhao, and Kang Liu. Representative demonstra- tion selection for in-context learning with two-stage determinantal point process. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5443–5456, Singapore, Decembe...

  57. [59]

    Ground-truth labels matter: A deeper look into input-label demonstrations

    Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu D...

  58. [60]

    Unlocking black-box prompt tuning efficiency via zeroth-order optimization

    Heshen Zhan, Congliang Chen, Tian Ding, Ziniu Li, and Ruoyu Sun. Unlocking black-box prompt tuning efficiency via zeroth-order optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14825– 14838, Miami, Florida, USA, November 2024. Association for Computation...

  59. [61]

    Batch-ICL: Effective, efficient, and order-agnostic in-context learning

    Kaiyi Zhang, Ang Lv, Yuhan Chen, Hansen Ha, Tao Xu, and Rui Yan. Batch-ICL: Effective, efficient, and order-agnostic in-context learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 10728–10739, Bangkok, Thailand, August 2024. Association for Computational Linguistic...

  60. [62]

    Dpzero: private fine-tuning of language models without backpropagation

    Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. Dpzero: private fine-tuning of language models without backpropagation. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  61. [63]

    D.Va: Validate your demonstration first before you use it

    Qi Zhang, Zhiqing Xiao, Ruixuan Xiao, Lirong Gao, and Junbo Zhao. D.Va: Validate your demonstration first before you use it. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2580–2594, Vienna, Austri...

  62. [64]

    COME: Test-time adaption by conservatively minimizing entropy

    Qingyang Zhang, Yatao Bian, Xinke Kong, Peilin Zhao, and Changqing Zhang. COME: Test-time adaption by conservatively minimizing entropy. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=506BjJ1ziZ

  63. [65]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 18–24 Jul 2021. URL https://p...

  64. [66]

    Large language models are not robust multiple choice selectors

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. InThe Twelfth International Conference on Learning Representations,

  65. [67]

    URLhttps://openreview.net/forum?id=shr9PXz7T0

  66. [68]

    Clf. ”) tasks have a fixed label space; generation tasks require open-ended output. Dictionary (“Dict

    Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=L3FHMoKZcS. 21 A Applicability of Existing Test-Time Met...