UCS: Estimating Unseen Coverage for Improved In-Context Learning

Evan Qiang; Jiayi Xin; Qi Long; Tianqi Shang; Weijie J. Su; Weiqing He; Xiang Li

arxiv: 2604.12015 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.CL

UCS: Estimating Unseen Coverage for Improved In-Context Learning

Jiayi Xin , Xiang Li , Evan Qiang , Weiqing He , Tianqi Shang , Weijie J. Su , Qi Long This is my paper

Pith reviewed 2026-05-10 15:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords in-context learningdemonstration selectionunseen coverageGood-Turing estimatorlatent clusterslarge language modelsprompt engineering

0 comments

The pith

Estimating unseen task clusters in demonstration sets improves in-context learning accuracy by up to 2-6%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Unseen Coverage Selection as a training-free way to pick demonstrations for in-context learning. It first turns model embeddings into discrete clusters that capture different parts of the task, then applies a statistical estimator to count how many of those clusters are missing from any given subset of examples. The resulting coverage score is added to existing selection rules through a regularized objective, producing higher accuracy on classification and reasoning tasks. A sympathetic reader would care because the approach directly addresses the limited space in prompts by focusing on what the model has not yet encountered rather than just relevance or diversity of the chosen examples.

Core claim

UCS operationalizes the principle that a good demonstration set should expose the model to latent clusters unrevealed by the currently selected subset by inducing discrete latent clusters from model-consistent embeddings and estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good-Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that,

What carries the argument

The Smoothed Good-Turing estimator applied to the empirical frequency spectrum of discrete latent clusters induced from model-consistent embeddings, which quantifies the unseen coverage of any candidate demonstration subset.

If this is right

Augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget.
Yields insights into task- and model-level latent cluster distributions.
Can be combined with both query-dependent and query-independent selection baselines via a regularized objective.
Applies across multiple intent-classification and reasoning benchmarks with frontier large language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coverage principle might extend to selecting examples for generative tasks where prompt space is limited.
Cluster distribution insights could help diagnose model-specific strengths on particular task types.
Testing the estimator with alternative embedding methods would show how sensitive the gains are to the cluster induction step.

Load-bearing premise

That discrete latent clusters induced from model-consistent embeddings meaningfully represent unrevealed task structure and that the Smoothed Good-Turing estimator on the empirical frequency spectrum accurately estimates the number of unrevealed clusters.

What would settle it

If adding UCS to strong baselines produces no consistent accuracy gains on standard ICL benchmarks under fixed selection budgets, the benefit of estimating unseen cluster coverage would not hold.

Figures

Figures reproduced from arXiv: 2604.12015 by Evan Qiang, Jiayi Xin, Qi Long, Tianqi Shang, Weijie J. Su, Weiqing He, Xiang Li.

**Figure 1.** Figure 1: Contrast to prior methods (left), our approach reasons at the subset level by clustering demonstrations into latent clusters and using their frequency to measure coverage (right). In this work, we argue that coverage provides a complementary dimension for reasoning about demonstration selection ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Schematics of UCS. (1) Demonstrations are embedded in a model-consistent space. (2) Embeddings are discretized into latent clusters via dictionary learning and clustering. (3) Candidate subsets are scored using an SGT estimation-regularized objective with a base selection utility. (4) The selected subset is used for ICL. procedure: (i) learning a latent dictionary over embeddings, (ii) encoding each examp… view at source ↗

**Figure 3.** Figure 3: Cluster-size distributions across datasets and LLMs. Left: number of demonstrations in clusters of size k ∈ [1, 8] (heavy-tailed, many singletons). Right: sizes of the top-8 clusters (a few dominant clusters). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good--Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at https://github.com/Raina-Xin/UCS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCS adds a coverage prior to ICL selection via embedding clusters and Smoothed Good-Turing, with small consistent gains, but the estimator's reliability on tiny finite pools is the open question.

read the letter

The paper's main move is to treat demonstration selection as a coverage problem: induce clusters from the model's own embeddings of the candidate pool, then use Smoothed Good-Turing on the subset's cluster frequency counts to estimate how many clusters remain unseen. They fold this estimate into a regularized objective that sits on top of existing relevance or diversity selectors. The result is a training-free method that lifts accuracy 2-6% on intent classification and reasoning benchmarks while staying within the same shot budget. Code is out, which helps.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Unseen Coverage Selection (UCS), a training-free method for in-context learning demonstration selection. It induces discrete latent clusters from model-consistent embeddings of the candidate pool and applies a Smoothed Good-Turing estimator to the empirical frequency spectrum of a selected subset to estimate the number of unrevealed clusters. This coverage estimate is incorporated as a regularized term in the selection objective, which can be combined with existing query-dependent or query-independent baselines. Experiments on intent classification and reasoning benchmarks with frontier LLMs report consistent accuracy gains of up to 2-6% under fixed selection budgets, along with analyses of task- and model-level cluster distributions.

Significance. If the reported gains hold under rigorous controls, the work contributes a coverage-based prior that complements relevance and diversity heuristics in ICL without requiring training. The training-free design and use of an established statistical estimator (Smoothed Good-Turing) are strengths, as is the public code release. The approach also yields interpretable insights into latent structures, which could inform future work on demonstration selection and model behavior.

major comments (3)

[§3.2] §3.2: The central claim that UCS improves ICL accuracy by estimating unrevealed clusters rests on the Smoothed Good-Turing estimator accurately mapping the empirical frequency spectrum of small subsets (typical ICL budgets of 8-32 examples) to the number of unseen clusters in a finite candidate pool. However, Good-Turing was derived for large open populations with power-law frequencies; the manuscript provides no theoretical adaptation or sensitivity analysis showing that the estimator remains reliable when subset sizes are small and cluster assignments depend on embedding quality plus hyperparameters K and linkage.
[§4.3 and Table 2] §4.3 and Table 2: The reported 2-6% gains when augmenting baselines with UCS are load-bearing for the contribution, yet the experiments lack ablations that isolate the coverage term (e.g., replacing the Good-Turing estimate with a simple diversity regularizer or random noise) or vary the number of latent clusters. Without these, it remains possible that observed improvements arise from incidental regularization rather than meaningful unseen-coverage guidance.
[§3.1] §3.1: The assumption that discrete clusters induced from model embeddings represent latent task structure that is 'unrevealed' by the current subset is not validated against ground-truth task labels or human annotations. If cluster assignments primarily reflect surface-level embedding artifacts rather than semantic task structure, the coverage prior may not target the intended quantity.

minor comments (2)

[Eq. 3] The notation for the regularized objective (Eq. 3) could be clarified by explicitly stating how the coverage estimate is normalized and scaled relative to the baseline score.
[Figure 3] Figure 3 (cluster distribution visualizations) would benefit from error bars or multiple random seeds to show stability of the reported task- and model-level insights.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We appreciate the recognition of UCS as a training-free coverage prior that complements existing ICL selection methods. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2: The central claim that UCS improves ICL accuracy by estimating unrevealed clusters rests on the Smoothed Good-Turing estimator accurately mapping the empirical frequency spectrum of small subsets (typical ICL budgets of 8-32 examples) to the number of unseen clusters in a finite candidate pool. However, Good-Turing was derived for large open populations with power-law frequencies; the manuscript provides no theoretical adaptation or sensitivity analysis showing that the estimator remains reliable when subset sizes are small and cluster assignments depend on embedding quality plus hyperparameters K and linkage.

Authors: We acknowledge that the original Good-Turing estimator was developed for large populations, and the manuscript does not provide a new theoretical derivation for the small-subset ICL regime. However, the smoothed variant has been applied successfully in other finite-sample settings in NLP and statistics. Our empirical results across multiple benchmarks and models indicate that the estimator provides useful coverage signals under typical ICL budgets. To strengthen the paper, we will add a sensitivity analysis in the appendix that varies subset size (8–32), the number of clusters K, and linkage methods, reporting how the unseen-coverage estimate and final accuracy change. This will empirically demonstrate reliability in the relevant operating range. revision: yes
Referee: [§4.3 and Table 2] §4.3 and Table 2: The reported 2-6% gains when augmenting baselines with UCS are load-bearing for the contribution, yet the experiments lack ablations that isolate the coverage term (e.g., replacing the Good-Turing estimate with a simple diversity regularizer or random noise) or vary the number of latent clusters. Without these, it remains possible that observed improvements arise from incidental regularization rather than meaningful unseen-coverage guidance.

Authors: We agree that isolating the contribution of the unseen-coverage term is necessary to rule out incidental regularization effects. In the revised manuscript we will add three targeted ablations: (1) replacing the Smoothed Good-Turing estimate with a simple pairwise-distance diversity regularizer, (2) substituting random noise for the coverage term while keeping the same regularization weight, and (3) sweeping the number of latent clusters K. These experiments will be reported alongside the existing results in §4.3 and Table 2 to confirm that the observed gains are attributable to the coverage estimate rather than generic regularization. revision: yes
Referee: [§3.1] §3.1: The assumption that discrete clusters induced from model embeddings represent latent task structure that is 'unrevealed' by the current subset is not validated against ground-truth task labels or human annotations. If cluster assignments primarily reflect surface-level embedding artifacts rather than semantic task structure, the coverage prior may not target the intended quantity.

Authors: This is a fair observation. The clusters are derived from model-consistent embeddings that are intended to capture semantic similarity, and our analyses of cluster distributions across tasks and models provide indirect support that they align with meaningful data variations. However, we do not provide direct validation against ground-truth labels or human judgments. We will add a qualitative appendix with representative examples of cluster assignments and their semantic interpretations for the intent-classification tasks. A comprehensive human annotation study would require substantial additional effort and may be left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity in UCS derivation or claims

full rationale

The paper defines UCS by inducing clusters from model embeddings and applying the external Smoothed Good-Turing estimator to the empirical frequency spectrum of a subset to estimate unrevealed clusters, then incorporates this coverage term into a regularized selection objective with existing baselines. All performance claims (2-6% gains) are presented as empirical results from benchmark experiments rather than algebraic identities or fitted parameters that encode the target outcome by construction. No equations or steps reduce the central method or results to self-definition, renamed inputs, or load-bearing self-citations; the estimator is a classic statistical tool independent of the paper's data or claims. The derivation chain is therefore self-contained with external empirical validation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that induced clusters capture relevant unrevealed structure and that the Good-Turing estimator provides a valid unseen count; potential free parameters include regularization strength and cluster count.

free parameters (2)

regularization strength
Used in the regularized objective combining UCS with baselines; value not specified in abstract but required for the method.
number of latent clusters
Chosen when inducing discrete clusters from embeddings; affects the frequency spectrum and unseen estimate.

axioms (2)

domain assumption Model-consistent embeddings reflect latent task-relevant structure
Invoked when inducing discrete latent clusters from embeddings to operationalize coverage.
domain assumption Smoothed Good-Turing estimator accurately estimates unrevealed clusters from empirical frequency spectrum
Core to step (2) of UCS for estimating the number of unrevealed clusters.

pith-pipeline@v0.9.0 · 5525 in / 1288 out tokens · 28791 ms · 2026-05-10T15:35:58.949251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others

Skill-based few-shot selection for in-context learning.arXiv preprint arXiv:2305.14210. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33:187...

work page arXiv 2020
[2]

Coverage-based example selection for in- context learning.arXiv preprint arXiv:2305.14907. Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaib- hav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, and 1 others. 2025. Big- bench extra hard. InProceedings of the 63rd An- nual Meeting o...

work page arXiv 2025
[3]

Hui Lin and Jeff Bilmes

Evaluating the unseen capabilities: How 10 many theorems do llms know?arXiv preprint arXiv:2506.02058. Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. InProceed- ings of the 49th annual meeting of the association for computational linguistics: human language tech- nologies, pages 510–520. Haoyu Liu, Jianfeng Liu,...

work page arXiv 2011
[4]

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao

Optimal prediction of the number of unseen species.Proceedings of the National Academy of Sciences, 113(47):13283–13288. Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao

work page
[5]

InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090– 9101

Revisiting demonstration selection strategies in in-context learning. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090– 9101. Chengwei Qin, Aston Zhang, Chen Chen, Anirudh Da- gar, and Wenming Ye. 2024. In-context learning with iterative demonstration selection. InFindings of t...

work page arXiv 2024

[1] [1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others

Skill-based few-shot selection for in-context learning.arXiv preprint arXiv:2305.14210. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33:187...

work page arXiv 2020

[2] [2]

Coverage-based example selection for in- context learning.arXiv preprint arXiv:2305.14907. Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaib- hav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, and 1 others. 2025. Big- bench extra hard. InProceedings of the 63rd An- nual Meeting o...

work page arXiv 2025

[3] [3]

Hui Lin and Jeff Bilmes

Evaluating the unseen capabilities: How 10 many theorems do llms know?arXiv preprint arXiv:2506.02058. Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. InProceed- ings of the 49th annual meeting of the association for computational linguistics: human language tech- nologies, pages 510–520. Haoyu Liu, Jianfeng Liu,...

work page arXiv 2011

[4] [4]

Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao

Optimal prediction of the number of unseen species.Proceedings of the National Academy of Sciences, 113(47):13283–13288. Keqin Peng, Liang Ding, Yancheng Yuan, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao

work page

[5] [5]

InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090– 9101

Revisiting demonstration selection strategies in in-context learning. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9090– 9101. Chengwei Qin, Aston Zhang, Chen Chen, Anirudh Da- gar, and Wenming Ye. 2024. In-context learning with iterative demonstration selection. InFindings of t...

work page arXiv 2024