Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

Bing Qin; Chaofen Yang; Haochun Wang; Jiatong Liu; Jingbo Wang; Sendong Zhao; Ting Liu; Zewen Qiang

arxiv: 2605.18512 · v1 · pith:475STBQKnew · submitted 2026-05-18 · 💻 cs.CL

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

Haochun Wang , Chaofen Yang , Jiatong Liu , Jingbo Wang , Zewen Qiang , Sendong Zhao , Bing Qin , Ting Liu This is my paper

Pith reviewed 2026-05-20 11:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords in-context learningdemonstration selectionquery difficultysample-and-judgeLLM promptingclassificationstop-on-acceptance

0 comments

The pith

Judging whether a query and demonstration set will succeed is cheaper than searching for the optimal set in in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that demonstration selection for in-context learning is easier to judge than to find because predicting success for a specific query-context pair costs less than exhaustively searching combinations. DiSP implements this by running random trials on training queries to measure success rates, training a router to predict query difficulty, and training separate judges for each difficulty level. At test time the system samples contexts and judges them in order until one succeeds or the budget ends, which produces higher accuracy than learned baselines while cutting runtime substantially. A sympathetic reader would care because this reframes prompt engineering as a prediction task rather than an optimization task, making reliable in-context learning more practical for classification.

Core claim

DiSP is a sample-and-judge framework that stratifies queries by difficulty. It runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found.

What carries the argument

The sample-and-judge framework that uses success-rate estimates from random trials to train a difficulty router and level-specific judges, then applies stop-on-acceptance selection at inference.

If this is right

DiSP achieves the best average accuracy across five classification datasets with Llama 3-8B and Qwen 2.5-7B.
It improves over strong learned selection baselines by up to 3.4%.
It achieves up to 23× end-to-end wall-clock speedup.
It emits diagnostic risk tags when no suitable context is found within budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trial-and-judge pattern could be applied to non-classification tasks such as reasoning or generation by redefining success.
Variable per-query latency from stop-on-acceptance could be mitigated by caching common difficulty levels in production systems.
If random trials under-sample rare query types, the router might systematically misclassify those cases and reduce overall reliability.

Load-bearing premise

Success rates measured from random demonstration trials on training queries are sufficiently predictive to train a generalizable difficulty router and level-specific judges that transfer to unseen test queries under the stop-on-acceptance policy.

What would settle it

If the trained router and judges produce no accuracy gain or speedup on held-out test queries compared with random selection, the claim that trial-based difficulty prediction generalizes would be falsified.

Figures

Figures reproduced from arXiv: 2605.18512 by Bing Qin, Chaofen Yang, Haochun Wang, Jiatong Liu, Jingbo Wang, Sendong Zhao, Ting Liu, Zewen Qiang.

**Figure 1.** Figure 1: Finding vs. judging for demonstration selection. Searching for an optimal D ⋆ faces a combinatorial space, while judging enables efficient sample-and-test with stop-on-acceptance under an explicit budget. combinatorial. Given a candidate pool of size N, selecting k demonstrations and ordering them yields N k k! possible demonstrations. Brute-force evaluation is typically infeasible because each candida… view at source ↗

**Figure 2.** Figure 2: Overview of DiSP. Stage 1: run the target LLM on each training query under multiple random k-shot contexts to label success and estimate an empirical success rate for difficulty stratification. Stage 2: train a router and level-specific judges to predict success for a given (q, D) pair. Stage 3: at test time, route each query and apply stop-on-acceptance sample-and-judge over sampled contexts up to a budge… view at source ↗

**Figure 3.** Figure 3: Hidden-state probes provide evidence that success and failure form separable clusters in the representation space (LLaMA3-8B on TREC). We report AUROC/AUPRC in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Last-layer MLP probe ROC/PR curves for success prediction on TREC. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.7996) incorrect (0.8516) Macro (0.8264) Micro (0.9763) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.9766) incorrect (0.3204) Micro (0.9686) SST2_LLAMA3_8B (Accuracy: 0.9348) (a) LLaMA3-8B 0… view at source ↗

**Figure 5.** Figure 5: Last-layer MLP probe ROC/PR curves for success prediction on SST-2. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.6249) incorrect (0.6250) Macro (0.6253) Micro (0.6238) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.5855) incorrect (0.6381) Micro (0.6088) SST5_LLAMA3_8B (Accuracy: 0.5954) (a) LLaMA3-8B … view at source ↗

**Figure 6.** Figure 6: Last-layer MLP probe ROC/PR curves for success prediction on SST-5. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.8634) incorrect (0.8803) Macro (0.8723) Micro (0.9443) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.9511) incorrect (0.6913) Micro (0.9311) AGNEWS_LLAMA3_8B (Accuracy: 0.8901) (a) LLaMA3-8… view at source ↗

**Figure 7.** Figure 7: Last-layer MLP probe ROC/PR curves for success prediction on AGNEWS. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.8398) incorrect (0.8500) Macro (0.8452) Micro (0.8669) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.8907) incorrect (0.7440) Micro (0.8538) MNLI_LLAMA3_8B (Accuracy: 0.7848) (a) LLaMA3-8B… view at source ↗

**Figure 8.** Figure 8: Last-layer MLP probe ROC/PR curves for success prediction on MNLI. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiSP shifts demonstration selection from search to judgment with a difficulty router and budgeted judges, delivering reported accuracy gains and large speedups, though transfer from training-query labels to test queries remains the key open question.

read the letter

DiSP reframes finding good demonstrations for in-context learning as easier to judge than to search for. The authors estimate success rates with random trials on training queries, train a router to assign difficulty levels, and use level-specific judges that stop early once they accept a context. On five classification datasets with Llama 3-8B and Qwen 2.5-7B, this gives the highest average accuracy and up to 23 times faster runtime than strong selection baselines. The distinct part is the explicit difficulty stratification plus the budgeted stop-on-acceptance policy. It turns the selection into something more predictable and adds risk tags for cases where no suitable context is found within budget. The paper does well by grounding the difficulty labels in independently run random trials rather than deriving them from the model being optimized. The reported accuracy edge and large speedups are the kind of numbers that matter for practical use. The soft spot is the transfer assumption. Success rates measured on the training split may not produce a router and judges that generalize to test queries, especially if difficulty is not stable across distributions. Without variance numbers or more ablations, the strength of the central claim is provisional. This is for practitioners scaling in-context learning where selection cost is an issue. A reader who cares about wall-clock efficiency and reliable prompts will get concrete ideas from it. The paper engages directly with the literature on ICL selection and presents a workable alternative. It deserves a serious referee. I would recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that demonstration selection for in-context learning is easier to judge than to find. It introduces DiSP, a sample-and-judge framework that estimates per-query success rates via random demonstration trials on the training split, trains a lightweight router to assign difficulty levels from query features, and trains level-specific judges. At inference, DiSP applies stop-on-acceptance judging under an explicit budget and emits risk tags when no suitable context is accepted. On five classification datasets using Llama 3-8B and Qwen 2.5-7B, DiSP reports the highest average accuracy (improving up to 3.4% over strong learned baselines) together with up to 23× end-to-end wall-clock speedup.

Significance. If the reported generalization of the difficulty router and level-specific judges holds, the work supplies a practical, budget-aware alternative to exhaustive or learned demonstration search. The concrete accuracy and speedup numbers, together with the diagnostic risk tags, would constitute a useful engineering contribution for reliable ICL deployment. The approach also supplies an explicit, falsifiable test of whether query difficulty is sufficiently stable to be predicted from surface features alone.

major comments (2)

[§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.
[§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.

minor comments (2)

[Abstract and §4.1] The abstract states 'improving over strong learned selection baselines by up to 3.4%'; the main text should clarify whether this is absolute accuracy or relative improvement and report per-dataset deltas with standard deviations.
[§3.2] Notation for the router input features and the judge scoring function is introduced without an explicit equation; adding a compact mathematical definition would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested analyses.

read point-by-point responses

Referee: [§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.

Authors: We agree that direct evidence of generalization from training-query success rates to test queries strengthens the central claim. In the revised manuscript we add (i) the correlation between router-predicted difficulty and empirical success rates measured on the held-out test split and (ii) an ablation in which the router is trained on a random 80 % subset of the training queries and evaluated on the remaining 20 %. Both results are reported in a new paragraph of §4.2 together with the corresponding figures; they indicate that the stratification captures stable query properties rather than merely memorizing the training sampling distribution. revision: yes
Referee: [§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.

Authors: We appreciate the request for a sensitivity study of the acceptance budget. The revised version includes a new sensitivity table (expanded Table 2) and an accompanying paragraph in §3.3 that reports accuracy and wall-clock time for budgets of 1, 5 and 10. We also evaluate the end-to-end speedup when the budget is set to match the average inference cost of the strongest learned baseline; the 23× figure remains stable under this cost-matched regime. The updated text explicitly discusses the trade-off between accuracy, latency and risk-tag frequency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical labels from independent trials support standard supervised training

full rationale

The paper estimates per-query success rates directly from random demonstration trials on the training split, then uses those observed rates to label difficulty levels for training a router and level-specific judges. This is a conventional supervised pipeline: expensive sampling produces independent labels, a lightweight model is fit to predict difficulty from query features alone, and inference applies the trained components under a stop-on-acceptance policy. No equation or step defines success via the router itself, renames a fitted parameter as a prediction, or reduces the central claim to a self-citation chain. The reported accuracy and speedup results are measured on held-out test queries, making the derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the empirical observability of ICL success via sampling and the assumption that lightweight predictors trained on those samples generalize; no new physical entities are introduced.

free parameters (2)

number of random demonstration trials per training query
Used to estimate per-query success rates for router and judge training.
acceptance budget at inference
Explicit limit on number of judged contexts before emitting a risk tag.

axioms (2)

domain assumption In-context learning performance is sensitive to demonstration choice and can be estimated from finite random trials.
Foundational premise stated in the problem setup.
domain assumption A lightweight model can learn to predict query difficulty and demonstration quality from sampled success data.
Core modeling assumption enabling the router and judges.

pith-pipeline@v0.9.0 · 5752 in / 1449 out tokens · 67413 ms · 2026-05-20T11:07:33.408539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[2]

UPRISE : Universal prompt retrieval for improving zero-shot evaluation

Cheng, D., Huang, S., Bi, J., Zhan, Y., Liu, J., Wang, Y., Sun, H., Wei, F., Deng, W., and Zhang, Q. UPRISE : Universal prompt retrieval for improving zero-shot evaluation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12318--12337, Singapore, December 2023. Asso...

work page doi:10.18653/v1/2023.emnlp-main.758 2023
[3]

In-context demonstration selection with cross entropy difference

Iter, D., Pryzant, R., Xu, R., Wang, S., Liu, Y., Xu, Y., and Zhu, C. In-context demonstration selection with cross entropy difference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1150--1162, 2023

work page 2023
[4]

Learning to rank for in-context example retrieval

Ji, Y., Zhang, L., Ambyerhan, Que, H., Shi, L., Chao, W., and Zhang, Y. Learning to rank for in-context example retrieval. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WyQ20adbUb

work page 2025
[5]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

and Roth, D

Li, X. and Roth, D. Learning question classifiers: The role of semantic information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002

work page 2002
[7]

𝑠𝑒2: Sequential example selection for in-context learning

Liu, H., Liu, J., Huang, S., Zhan, Y., Sun, H., Deng, W., Wei, F., and Zhang, Q. se^2 : Sequential example selection for in-context learning. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 5262--5284, Bangkok, Thailand, August 2024. Association for Computational Linguistics. do...

work page doi:10.18653/v1/2024.findings-acl.312 2024
[8]

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pp.\ 100--114, 2022

work page 2022
[9]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8086--8098, 2022

work page 2022
[10]

The Llama 3 Herd of Models , 2024

Meta AI . The Llama 3 Herd of Models , 2024. URL https://llama.meta.com/llama3/

work page 2024
[11]

Iterative amortized inference: Unifying in-context learning and learned optimizers

Mittal, S., Mahajan, D., Lajoie, G., and Pezeshki, M. Iterative amortized inference: Unifying in-context learning and learned optimizers. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page
[12]

and Wong, E

Nguyen, T. and Wong, E. In-context example selection with influences. arXiv preprint arXiv:2302.11042, 2023

work page arXiv 2023
[13]

In-context Learning and Induction Heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Optimizing instructions and demonstrations for multi-stage language model programs

Opsahl-Ong, K., Ryan, M., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 9340--9366, 2024

work page 2024
[15]

Qwen2.5: A Family of Large Language Models , 2024

Qwen Team . Qwen2.5: A Family of Large Language Models , 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[16]

Learning to retrieve prompts for in-context learning

Rubin, O., Herzig, J., and Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, pp.\ 2655--2671, 2022

work page 2022
[17]

Position: Do pretrained transformers learn in-context by gradient descent

Shen, L., Mishra, A., and Khashabi, D. Position: Do pretrained transformers learn in-context by gradient descent. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 44712--44740, 2024

work page 2024
[18]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

work page 2013
[19]

Prune'n predict: Optimizing llm decision-making with conformal prediction

Vishwakarma, H., Mishler, A., Cook, T., Dalmasso, N., Raman, N., and Ganesh, S. Prune'n predict: Optimizing llm decision-making with conformal prediction. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[20]

Transformers learn in-context by gradient descent

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.\ 35151--35174. PMLR, 2023

work page 2023
[21]

Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1112--1122, 2018

work page 2018
[22]

M., Raghunathan, A., Liang, P., and Ma, T

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022

work page 2022
[23]

and Lu, Y

Xu, B. and Lu, Y. Tecp: Token-entropy conformal prediction for llms. Mathematics, 13 0 (20): 0 3351, 2025

work page 2025
[24]

Batch-icl: Effective, efficient, and order-agnostic in-context learning

Zhang, K., Lv, A., Chen, Y., Ha, H., Xu, T., and Yan, R. Batch-icl: Effective, efficient, and order-agnostic in-context learning. In Findings of the Association for Computational Linguistics ACL 2024, pp.\ 10728--10739, 2024

work page 2024
[25]

Character-level convolutional networks for text classification

Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, 2015

work page 2015
[26]

Active example selection for in-context learning

Zhang, Y., Feng, S., and Tan, C. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 9134--9148, 2022

work page 2022
[27]

Learning to select in-context demonstration preferred by large language model

Zhang, Z., Lan, S., Song, L., Bian, J., Li, Y., and Ren, K. Learning to select in-context demonstration preferred by large language model. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 11345--11360, 2025. URL https://aclanthology.org/2025.findings-acl.592/

work page 2025
[28]

I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. In The eleventh international conference on learning representations, 2022

work page 2022

[1] [1]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901

[2] [2]

UPRISE : Universal prompt retrieval for improving zero-shot evaluation

Cheng, D., Huang, S., Bi, J., Zhan, Y., Liu, J., Wang, Y., Sun, H., Wei, F., Deng, W., and Zhang, Q. UPRISE : Universal prompt retrieval for improving zero-shot evaluation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12318--12337, Singapore, December 2023. Asso...

work page doi:10.18653/v1/2023.emnlp-main.758 2023

[3] [3]

In-context demonstration selection with cross entropy difference

Iter, D., Pryzant, R., Xu, R., Wang, S., Liu, Y., Xu, Y., and Zhu, C. In-context demonstration selection with cross entropy difference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1150--1162, 2023

work page 2023

[4] [4]

Learning to rank for in-context example retrieval

Ji, Y., Zhang, L., Ambyerhan, Que, H., Shi, L., Chao, W., and Zhang, Y. Learning to rank for in-context example retrieval. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WyQ20adbUb

work page 2025

[5] [5]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

and Roth, D

Li, X. and Roth, D. Learning question classifiers: The role of semantic information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002

work page 2002

[7] [7]

𝑠𝑒2: Sequential example selection for in-context learning

Liu, H., Liu, J., Huang, S., Zhan, Y., Sun, H., Deng, W., Wei, F., and Zhang, Q. se^2 : Sequential example selection for in-context learning. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 5262--5284, Bangkok, Thailand, August 2024. Association for Computational Linguistics. do...

work page doi:10.18653/v1/2024.findings-acl.312 2024

[8] [8]

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pp.\ 100--114, 2022

work page 2022

[9] [9]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8086--8098, 2022

work page 2022

[10] [10]

The Llama 3 Herd of Models , 2024

Meta AI . The Llama 3 Herd of Models , 2024. URL https://llama.meta.com/llama3/

work page 2024

[11] [11]

Iterative amortized inference: Unifying in-context learning and learned optimizers

Mittal, S., Mahajan, D., Lajoie, G., and Pezeshki, M. Iterative amortized inference: Unifying in-context learning and learned optimizers. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page

[12] [12]

and Wong, E

Nguyen, T. and Wong, E. In-context example selection with influences. arXiv preprint arXiv:2302.11042, 2023

work page arXiv 2023

[13] [13]

In-context Learning and Induction Heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Optimizing instructions and demonstrations for multi-stage language model programs

Opsahl-Ong, K., Ryan, M., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 9340--9366, 2024

work page 2024

[15] [15]

Qwen2.5: A Family of Large Language Models , 2024

Qwen Team . Qwen2.5: A Family of Large Language Models , 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024

[16] [16]

Learning to retrieve prompts for in-context learning

Rubin, O., Herzig, J., and Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, pp.\ 2655--2671, 2022

work page 2022

[17] [17]

Position: Do pretrained transformers learn in-context by gradient descent

Shen, L., Mishra, A., and Khashabi, D. Position: Do pretrained transformers learn in-context by gradient descent. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 44712--44740, 2024

work page 2024

[18] [18]

D., Ng, A

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

work page 2013

[19] [19]

Prune'n predict: Optimizing llm decision-making with conformal prediction

Vishwakarma, H., Mishler, A., Cook, T., Dalmasso, N., Raman, N., and Ganesh, S. Prune'n predict: Optimizing llm decision-making with conformal prediction. In Forty-second International Conference on Machine Learning, 2025

work page 2025

[20] [20]

Transformers learn in-context by gradient descent

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.\ 35151--35174. PMLR, 2023

work page 2023

[21] [21]

Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1112--1122, 2018

work page 2018

[22] [22]

M., Raghunathan, A., Liang, P., and Ma, T

Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022

work page 2022

[23] [23]

and Lu, Y

Xu, B. and Lu, Y. Tecp: Token-entropy conformal prediction for llms. Mathematics, 13 0 (20): 0 3351, 2025

work page 2025

[24] [24]

Batch-icl: Effective, efficient, and order-agnostic in-context learning

Zhang, K., Lv, A., Chen, Y., Ha, H., Xu, T., and Yan, R. Batch-icl: Effective, efficient, and order-agnostic in-context learning. In Findings of the Association for Computational Linguistics ACL 2024, pp.\ 10728--10739, 2024

work page 2024

[25] [25]

Character-level convolutional networks for text classification

Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, 2015

work page 2015

[26] [26]

Active example selection for in-context learning

Zhang, Y., Feng, S., and Tan, C. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 9134--9148, 2022

work page 2022

[27] [27]

Learning to select in-context demonstration preferred by large language model

Zhang, Z., Lan, S., Song, L., Bian, J., Li, Y., and Ren, K. Learning to select in-context demonstration preferred by large language model. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 11345--11360, 2025. URL https://aclanthology.org/2025.findings-acl.592/

work page 2025

[28] [28]

I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J

Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. In The eleventh international conference on learning representations, 2022

work page 2022