Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection
Pith reviewed 2026-05-20 11:07 UTC · model grok-4.3
The pith
Judging whether a query and demonstration set will succeed is cheaper than searching for the optimal set in in-context learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiSP is a sample-and-judge framework that stratifies queries by difficulty. It runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found.
What carries the argument
The sample-and-judge framework that uses success-rate estimates from random trials to train a difficulty router and level-specific judges, then applies stop-on-acceptance selection at inference.
If this is right
- DiSP achieves the best average accuracy across five classification datasets with Llama 3-8B and Qwen 2.5-7B.
- It improves over strong learned selection baselines by up to 3.4%.
- It achieves up to 23× end-to-end wall-clock speedup.
- It emits diagnostic risk tags when no suitable context is found within budget.
Where Pith is reading between the lines
- The same trial-and-judge pattern could be applied to non-classification tasks such as reasoning or generation by redefining success.
- Variable per-query latency from stop-on-acceptance could be mitigated by caching common difficulty levels in production systems.
- If random trials under-sample rare query types, the router might systematically misclassify those cases and reduce overall reliability.
Load-bearing premise
Success rates measured from random demonstration trials on training queries are sufficiently predictive to train a generalizable difficulty router and level-specific judges that transfer to unseen test queries under the stop-on-acceptance policy.
What would settle it
If the trained router and judges produce no accuracy gain or speedup on held-out test queries compared with random selection, the claim that trial-based difficulty prediction generalizes would be falsified.
Figures
read the original abstract
In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that demonstration selection for in-context learning is easier to judge than to find. It introduces DiSP, a sample-and-judge framework that estimates per-query success rates via random demonstration trials on the training split, trains a lightweight router to assign difficulty levels from query features, and trains level-specific judges. At inference, DiSP applies stop-on-acceptance judging under an explicit budget and emits risk tags when no suitable context is accepted. On five classification datasets using Llama 3-8B and Qwen 2.5-7B, DiSP reports the highest average accuracy (improving up to 3.4% over strong learned baselines) together with up to 23× end-to-end wall-clock speedup.
Significance. If the reported generalization of the difficulty router and level-specific judges holds, the work supplies a practical, budget-aware alternative to exhaustive or learned demonstration search. The concrete accuracy and speedup numbers, together with the diagnostic risk tags, would constitute a useful engineering contribution for reliable ICL deployment. The approach also supplies an explicit, falsifiable test of whether query difficulty is sufficiently stable to be predicted from surface features alone.
major comments (2)
- [§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.
- [§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.
minor comments (2)
- [Abstract and §4.1] The abstract states 'improving over strong learned selection baselines by up to 3.4%'; the main text should clarify whether this is absolute accuracy or relative improvement and report per-dataset deltas with standard deviations.
- [§3.2] Notation for the router input features and the judge scoring function is introduced without an explicit equation; adding a compact mathematical definition would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested analyses.
read point-by-point responses
-
Referee: [§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.
Authors: We agree that direct evidence of generalization from training-query success rates to test queries strengthens the central claim. In the revised manuscript we add (i) the correlation between router-predicted difficulty and empirical success rates measured on the held-out test split and (ii) an ablation in which the router is trained on a random 80 % subset of the training queries and evaluated on the remaining 20 %. Both results are reported in a new paragraph of §4.2 together with the corresponding figures; they indicate that the stratification captures stable query properties rather than merely memorizing the training sampling distribution. revision: yes
-
Referee: [§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.
Authors: We appreciate the request for a sensitivity study of the acceptance budget. The revised version includes a new sensitivity table (expanded Table 2) and an accompanying paragraph in §3.3 that reports accuracy and wall-clock time for budgets of 1, 5 and 10. We also evaluate the end-to-end speedup when the budget is set to match the average inference cost of the strongest learned baseline; the 23× figure remains stable under this cost-matched regime. The updated text explicitly discusses the trade-off between accuracy, latency and risk-tag frequency. revision: yes
Circularity Check
No significant circularity; empirical labels from independent trials support standard supervised training
full rationale
The paper estimates per-query success rates directly from random demonstration trials on the training split, then uses those observed rates to label difficulty levels for training a router and level-specific judges. This is a conventional supervised pipeline: expensive sampling produces independent labels, a lightweight model is fit to predict difficulty from query features alone, and inference applies the trained components under a stop-on-acceptance policy. No equation or step defines success via the router itself, renames a fitted parameter as a prediction, or reduces the central claim to a self-citation chain. The reported accuracy and speedup results are measured on held-out test queries, making the derivation self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of random demonstration trials per training query
- acceptance budget at inference
axioms (2)
- domain assumption In-context learning performance is sensitive to demonstration choice and can be estimated from finite random trials.
- domain assumption A lightweight model can learn to predict query difficulty and demonstration quality from sampled success data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[2]
UPRISE : Universal prompt retrieval for improving zero-shot evaluation
Cheng, D., Huang, S., Bi, J., Zhan, Y., Liu, J., Wang, Y., Sun, H., Wei, F., Deng, W., and Zhang, Q. UPRISE : Universal prompt retrieval for improving zero-shot evaluation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12318--12337, Singapore, December 2023. Asso...
-
[3]
In-context demonstration selection with cross entropy difference
Iter, D., Pryzant, R., Xu, R., Wang, S., Liu, Y., Xu, Y., and Zhu, C. In-context demonstration selection with cross entropy difference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1150--1162, 2023
work page 2023
-
[4]
Learning to rank for in-context example retrieval
Ji, Y., Zhang, L., Ambyerhan, Que, H., Shi, L., Chao, W., and Zhang, Y. Learning to rank for in-context example retrieval. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WyQ20adbUb
work page 2025
-
[5]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Li, X. and Roth, D. Learning question classifiers: The role of semantic information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002
work page 2002
-
[7]
𝑠𝑒2: Sequential example selection for in-context learning
Liu, H., Liu, J., Huang, S., Zhan, Y., Sun, H., Deng, W., Wei, F., and Zhang, Q. se^2 : Sequential example selection for in-context learning. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 5262--5284, Bangkok, Thailand, August 2024. Association for Computational Linguistics. do...
-
[8]
Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pp.\ 100--114, 2022
work page 2022
-
[9]
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity
Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8086--8098, 2022
work page 2022
-
[10]
The Llama 3 Herd of Models , 2024
Meta AI . The Llama 3 Herd of Models , 2024. URL https://llama.meta.com/llama3/
work page 2024
-
[11]
Iterative amortized inference: Unifying in-context learning and learned optimizers
Mittal, S., Mahajan, D., Lajoie, G., and Pezeshki, M. Iterative amortized inference: Unifying in-context learning and learned optimizers. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
-
[12]
Nguyen, T. and Wong, E. In-context example selection with influences. arXiv preprint arXiv:2302.11042, 2023
-
[13]
In-context Learning and Induction Heads
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Optimizing instructions and demonstrations for multi-stage language model programs
Opsahl-Ong, K., Ryan, M., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 9340--9366, 2024
work page 2024
-
[15]
Qwen2.5: A Family of Large Language Models , 2024
Qwen Team . Qwen2.5: A Family of Large Language Models , 2024. URL https://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[16]
Learning to retrieve prompts for in-context learning
Rubin, O., Herzig, J., and Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, pp.\ 2655--2671, 2022
work page 2022
-
[17]
Position: Do pretrained transformers learn in-context by gradient descent
Shen, L., Mishra, A., and Khashabi, D. Position: Do pretrained transformers learn in-context by gradient descent. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 44712--44740, 2024
work page 2024
-
[18]
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013
work page 2013
-
[19]
Prune'n predict: Optimizing llm decision-making with conformal prediction
Vishwakarma, H., Mishler, A., Cook, T., Dalmasso, N., Raman, N., and Ganesh, S. Prune'n predict: Optimizing llm decision-making with conformal prediction. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[20]
Transformers learn in-context by gradient descent
Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.\ 35151--35174. PMLR, 2023
work page 2023
-
[21]
Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1112--1122, 2018
work page 2018
-
[22]
M., Raghunathan, A., Liang, P., and Ma, T
Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022
work page 2022
- [23]
-
[24]
Batch-icl: Effective, efficient, and order-agnostic in-context learning
Zhang, K., Lv, A., Chen, Y., Ha, H., Xu, T., and Yan, R. Batch-icl: Effective, efficient, and order-agnostic in-context learning. In Findings of the Association for Computational Linguistics ACL 2024, pp.\ 10728--10739, 2024
work page 2024
-
[25]
Character-level convolutional networks for text classification
Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, 2015
work page 2015
-
[26]
Active example selection for in-context learning
Zhang, Y., Feng, S., and Tan, C. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 9134--9148, 2022
work page 2022
-
[27]
Learning to select in-context demonstration preferred by large language model
Zhang, Z., Lan, S., Song, L., Bian, J., Li, Y., and Ren, K. Learning to select in-context demonstration preferred by large language model. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 11345--11360, 2025. URL https://aclanthology.org/2025.findings-acl.592/
work page 2025
-
[28]
I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. In The eleventh international conference on learning representations, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.