pith. sign in

arxiv: 2605.18512 · v1 · pith:475STBQKnew · submitted 2026-05-18 · 💻 cs.CL

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

Pith reviewed 2026-05-20 11:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords in-context learningdemonstration selectionquery difficultysample-and-judgeLLM promptingclassificationstop-on-acceptance
0
0 comments X

The pith

Judging whether a query and demonstration set will succeed is cheaper than searching for the optimal set in in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that demonstration selection for in-context learning is easier to judge than to find because predicting success for a specific query-context pair costs less than exhaustively searching combinations. DiSP implements this by running random trials on training queries to measure success rates, training a router to predict query difficulty, and training separate judges for each difficulty level. At test time the system samples contexts and judges them in order until one succeeds or the budget ends, which produces higher accuracy than learned baselines while cutting runtime substantially. A sympathetic reader would care because this reframes prompt engineering as a prediction task rather than an optimization task, making reliable in-context learning more practical for classification.

Core claim

DiSP is a sample-and-judge framework that stratifies queries by difficulty. It runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found.

What carries the argument

The sample-and-judge framework that uses success-rate estimates from random trials to train a difficulty router and level-specific judges, then applies stop-on-acceptance selection at inference.

If this is right

  • DiSP achieves the best average accuracy across five classification datasets with Llama 3-8B and Qwen 2.5-7B.
  • It improves over strong learned selection baselines by up to 3.4%.
  • It achieves up to 23× end-to-end wall-clock speedup.
  • It emits diagnostic risk tags when no suitable context is found within budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trial-and-judge pattern could be applied to non-classification tasks such as reasoning or generation by redefining success.
  • Variable per-query latency from stop-on-acceptance could be mitigated by caching common difficulty levels in production systems.
  • If random trials under-sample rare query types, the router might systematically misclassify those cases and reduce overall reliability.

Load-bearing premise

Success rates measured from random demonstration trials on training queries are sufficiently predictive to train a generalizable difficulty router and level-specific judges that transfer to unseen test queries under the stop-on-acceptance policy.

What would settle it

If the trained router and judges produce no accuracy gain or speedup on held-out test queries compared with random selection, the claim that trial-based difficulty prediction generalizes would be falsified.

Figures

Figures reproduced from arXiv: 2605.18512 by Bing Qin, Chaofen Yang, Haochun Wang, Jiatong Liu, Jingbo Wang, Sendong Zhao, Ting Liu, Zewen Qiang.

Figure 1
Figure 1. Figure 1: Finding vs. judging for demonstration selection. Search￾ing for an optimal D ⋆ faces a combinatorial space, while judging enables efficient sample-and-test with stop-on-acceptance under an explicit budget. combinatorial. Given a candidate pool of size N, selecting k demonstrations and ordering them yields N k  k! possible demonstrations. Brute-force evaluation is typically infea￾sible because each candida… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DiSP. Stage 1: run the target LLM on each training query under multiple random k-shot contexts to label success and estimate an empirical success rate for difficulty stratification. Stage 2: train a router and level-specific judges to predict success for a given (q, D) pair. Stage 3: at test time, route each query and apply stop-on-acceptance sample-and-judge over sampled contexts up to a budge… view at source ↗
Figure 3
Figure 3. Figure 3: Hidden-state probes provide evidence that success and failure form separable clusters in the representation space (LLaMA3-8B on TREC). We report AUROC/AUPRC in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Last-layer MLP probe ROC/PR curves for success prediction on TREC. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.7996) incorrect (0.8516) Macro (0.8264) Micro (0.9763) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.9766) incorrect (0.3204) Micro (0.9686) SST2_LLAMA3_8B (Accuracy: 0.9348) (a) LLaMA3-8B 0… view at source ↗
Figure 5
Figure 5. Figure 5: Last-layer MLP probe ROC/PR curves for success prediction on SST-2. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.6249) incorrect (0.6250) Macro (0.6253) Micro (0.6238) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.5855) incorrect (0.6381) Micro (0.6088) SST5_LLAMA3_8B (Accuracy: 0.5954) (a) LLaMA3-8B … view at source ↗
Figure 6
Figure 6. Figure 6: Last-layer MLP probe ROC/PR curves for success prediction on SST-5. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.8634) incorrect (0.8803) Macro (0.8723) Micro (0.9443) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.9511) incorrect (0.6913) Micro (0.9311) AGNEWS_LLAMA3_8B (Accuracy: 0.8901) (a) LLaMA3-8… view at source ↗
Figure 7
Figure 7. Figure 7: Last-layer MLP probe ROC/PR curves for success prediction on AGNEWS. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate AUROC Curve correct (0.8398) incorrect (0.8500) Macro (0.8452) Micro (0.8669) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision AUPRC Curve correct (0.8907) incorrect (0.7440) Micro (0.8538) MNLI_LLAMA3_8B (Accuracy: 0.7848) (a) LLaMA3-8B… view at source ↗
Figure 8
Figure 8. Figure 8: Last-layer MLP probe ROC/PR curves for success prediction on MNLI. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that demonstration selection for in-context learning is easier to judge than to find. It introduces DiSP, a sample-and-judge framework that estimates per-query success rates via random demonstration trials on the training split, trains a lightweight router to assign difficulty levels from query features, and trains level-specific judges. At inference, DiSP applies stop-on-acceptance judging under an explicit budget and emits risk tags when no suitable context is accepted. On five classification datasets using Llama 3-8B and Qwen 2.5-7B, DiSP reports the highest average accuracy (improving up to 3.4% over strong learned baselines) together with up to 23× end-to-end wall-clock speedup.

Significance. If the reported generalization of the difficulty router and level-specific judges holds, the work supplies a practical, budget-aware alternative to exhaustive or learned demonstration search. The concrete accuracy and speedup numbers, together with the diagnostic risk tags, would constitute a useful engineering contribution for reliable ICL deployment. The approach also supplies an explicit, falsifiable test of whether query difficulty is sufficiently stable to be predicted from surface features alone.

major comments (2)
  1. [§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.
  2. [§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.
minor comments (2)
  1. [Abstract and §4.1] The abstract states 'improving over strong learned selection baselines by up to 3.4%'; the main text should clarify whether this is absolute accuracy or relative improvement and report per-dataset deltas with standard deviations.
  2. [§3.2] Notation for the router input features and the judge scoring function is introduced without an explicit equation; adding a compact mathematical definition would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested analyses.

read point-by-point responses
  1. Referee: [§4 (Experiments) and §3.2 (Difficulty Router)] The headline accuracy and speedup claims rest on the transfer of success-rate labels obtained from random trials on training queries to unseen test queries. The manuscript should report the correlation between router-predicted difficulty and observed success rates on the test split, as well as an ablation that measures performance drop when the router is trained on a held-out portion of the training queries. Without these diagnostics, it remains unclear whether the stratification is capturing an intrinsic query property or merely fitting the training-query sampling distribution.

    Authors: We agree that direct evidence of generalization from training-query success rates to test queries strengthens the central claim. In the revised manuscript we add (i) the correlation between router-predicted difficulty and empirical success rates measured on the held-out test split and (ii) an ablation in which the router is trained on a random 80 % subset of the training queries and evaluated on the remaining 20 %. Both results are reported in a new paragraph of §4.2 together with the corresponding figures; they indicate that the stratification captures stable query properties rather than merely memorizing the training sampling distribution. revision: yes

  2. Referee: [§3.3 (Inference Procedure) and Table 2] The stop-on-acceptance policy with a fixed acceptance budget is central to both the speedup and the risk-tag mechanism. The paper should include a sensitivity analysis showing how accuracy and wall-clock time vary with different budget values (e.g., 1, 5, 10) and whether the reported 23× speedup remains stable when the budget is chosen to match the computational cost of the strongest baseline.

    Authors: We appreciate the request for a sensitivity study of the acceptance budget. The revised version includes a new sensitivity table (expanded Table 2) and an accompanying paragraph in §3.3 that reports accuracy and wall-clock time for budgets of 1, 5 and 10. We also evaluate the end-to-end speedup when the budget is set to match the average inference cost of the strongest learned baseline; the 23× figure remains stable under this cost-matched regime. The updated text explicitly discusses the trade-off between accuracy, latency and risk-tag frequency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical labels from independent trials support standard supervised training

full rationale

The paper estimates per-query success rates directly from random demonstration trials on the training split, then uses those observed rates to label difficulty levels for training a router and level-specific judges. This is a conventional supervised pipeline: expensive sampling produces independent labels, a lightweight model is fit to predict difficulty from query features alone, and inference applies the trained components under a stop-on-acceptance policy. No equation or step defines success via the router itself, renames a fitted parameter as a prediction, or reduces the central claim to a self-citation chain. The reported accuracy and speedup results are measured on held-out test queries, making the derivation self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the empirical observability of ICL success via sampling and the assumption that lightweight predictors trained on those samples generalize; no new physical entities are introduced.

free parameters (2)
  • number of random demonstration trials per training query
    Used to estimate per-query success rates for router and judge training.
  • acceptance budget at inference
    Explicit limit on number of judged contexts before emitting a risk tag.
axioms (2)
  • domain assumption In-context learning performance is sensitive to demonstration choice and can be estimated from finite random trials.
    Foundational premise stated in the problem setup.
  • domain assumption A lightweight model can learn to predict query difficulty and demonstration quality from sampled success data.
    Core modeling assumption enabling the router and judges.

pith-pipeline@v0.9.0 · 5752 in / 1449 out tokens · 67413 ms · 2026-05-20T11:07:33.408539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  2. [2]

    UPRISE : Universal prompt retrieval for improving zero-shot evaluation

    Cheng, D., Huang, S., Bi, J., Zhan, Y., Liu, J., Wang, Y., Sun, H., Wei, F., Deng, W., and Zhang, Q. UPRISE : Universal prompt retrieval for improving zero-shot evaluation. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12318--12337, Singapore, December 2023. Asso...

  3. [3]

    In-context demonstration selection with cross entropy difference

    Iter, D., Pryzant, R., Xu, R., Wang, S., Liu, Y., Xu, Y., and Zhu, C. In-context demonstration selection with cross entropy difference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1150--1162, 2023

  4. [4]

    Learning to rank for in-context example retrieval

    Ji, Y., Zhang, L., Ambyerhan, Que, H., Shi, L., Chao, W., and Zhang, Y. Learning to rank for in-context example retrieval. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WyQ20adbUb

  5. [5]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

  6. [6]

    and Roth, D

    Li, X. and Roth, D. Learning question classifiers: The role of semantic information. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002

  7. [7]

    𝑠𝑒2: Sequential example selection for in-context learning

    Liu, H., Liu, J., Huang, S., Zhan, Y., Sun, H., Deng, W., Wei, F., and Zhang, Q. se^2 : Sequential example selection for in-context learning. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 5262--5284, Bangkok, Thailand, August 2024. Association for Computational Linguistics. do...

  8. [8]

    Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., and Chen, W. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pp.\ 100--114, 2022

  9. [9]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Lu, Y., Bartolo, M., Moore, A., Riedel, S., and Stenetorp, P. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8086--8098, 2022

  10. [10]

    The Llama 3 Herd of Models , 2024

    Meta AI . The Llama 3 Herd of Models , 2024. URL https://llama.meta.com/llama3/

  11. [11]

    Iterative amortized inference: Unifying in-context learning and learned optimizers

    Mittal, S., Mahajan, D., Lajoie, G., and Pezeshki, M. Iterative amortized inference: Unifying in-context learning and learned optimizers. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  12. [12]

    and Wong, E

    Nguyen, T. and Wong, E. In-context example selection with influences. arXiv preprint arXiv:2302.11042, 2023

  13. [13]

    In-context Learning and Induction Heads

    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  14. [14]

    Optimizing instructions and demonstrations for multi-stage language model programs

    Opsahl-Ong, K., Ryan, M., Purtell, J., Broman, D., Potts, C., Zaharia, M., and Khattab, O. Optimizing instructions and demonstrations for multi-stage language model programs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 9340--9366, 2024

  15. [15]

    Qwen2.5: A Family of Large Language Models , 2024

    Qwen Team . Qwen2.5: A Family of Large Language Models , 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  16. [16]

    Learning to retrieve prompts for in-context learning

    Rubin, O., Herzig, J., and Berant, J. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, pp.\ 2655--2671, 2022

  17. [17]

    Position: Do pretrained transformers learn in-context by gradient descent

    Shen, L., Mishra, A., and Khashabi, D. Position: Do pretrained transformers learn in-context by gradient descent. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp.\ 44712--44740, 2024

  18. [18]

    D., Ng, A

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

  19. [19]

    Prune'n predict: Optimizing llm decision-making with conformal prediction

    Vishwakarma, H., Mishler, A., Cook, T., Dalmasso, N., Raman, N., and Ganesh, S. Prune'n predict: Optimizing llm decision-making with conformal prediction. In Forty-second International Conference on Machine Learning, 2025

  20. [20]

    Transformers learn in-context by gradient descent

    Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.\ 35151--35174. PMLR, 2023

  21. [21]

    Williams, A., Nangia, N., and Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 1112--1122, 2018

  22. [22]

    M., Raghunathan, A., Liang, P., and Ma, T

    Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022

  23. [23]

    and Lu, Y

    Xu, B. and Lu, Y. Tecp: Token-entropy conformal prediction for llms. Mathematics, 13 0 (20): 0 3351, 2025

  24. [24]

    Batch-icl: Effective, efficient, and order-agnostic in-context learning

    Zhang, K., Lv, A., Chen, Y., Ha, H., Xu, T., and Yan, R. Batch-icl: Effective, efficient, and order-agnostic in-context learning. In Findings of the Association for Computational Linguistics ACL 2024, pp.\ 10728--10739, 2024

  25. [25]

    Character-level convolutional networks for text classification

    Zhang, X., Zhao, J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, 2015

  26. [26]

    Active example selection for in-context learning

    Zhang, Y., Feng, S., and Tan, C. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 9134--9148, 2022

  27. [27]

    Learning to select in-context demonstration preferred by large language model

    Zhang, Z., Lan, S., Song, L., Bian, J., Li, Y., and Ren, K. Learning to select in-context demonstration preferred by large language model. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 11345--11360, 2025. URL https://aclanthology.org/2025.findings-acl.592/

  28. [28]

    I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J

    Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. Large language models are human-level prompt engineers. In The eleventh international conference on learning representations, 2022