pith. sign in

arxiv: 2605.03644 · v2 · pith:A4NJWH45new · submitted 2026-05-05 · 💻 cs.AI

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse

Pith reviewed 2026-05-07 16:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords many-shot in-context learningadaptive ICLKV cache reuseoutput entropyLLM inference optimizationdynamic shot selectionsemantic-aware caching
0
0 comments X

The pith

AdapShot selects the optimal number of in-context examples for each query by measuring output entropy in a probe run and reuses KV cache with reordering to enable efficient many-shot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many-shot in-context learning works better with more examples but a fixed shot count either under-supplies hard queries or adds noise to easy ones while long contexts drive up compute costs. AdapShot runs a short probe inference on each query and uses the resulting output entropy to choose the shot count that minimizes uncertainty for that specific input. To avoid repeating the expensive prefilling step for every probe and for the final run, it reuses prior key-value pairs by first decoupling them from their original positions and then re-encoding them to match the new sequence order. This combination lets the system adapt shot counts per query without the full cost of many-shot contexts.

Core claim

AdapShot dynamically optimizes shot counts using a probe-based evaluation with output entropy and employs a semantics-aware KV cache reuse strategy with decoupling and re-encoding to bypass redundant prefilling, achieving better performance and efficiency in many-shot in-context learning.

What carries the argument

Probe-based entropy evaluation for choosing shot count, together with decoupling and re-encoding to reorder cached key-value pairs for semantic-aware reuse.

If this is right

  • Queries receive different numbers of shots according to their measured entropy rather than a single fixed value.
  • Prefilling work is skipped for both probes and final inference by reusing and reordering existing KV pairs.
  • Positional encoding incompatibilities are handled so that reordered cache entries remain compatible with the model.
  • Overall inference becomes faster while average accuracy rises compared with static or prior adaptive baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The entropy probe could serve as a lightweight signal for adapting other ICL choices such as example ordering or format.
  • Similar decoupling techniques might reduce recomputation in other long-context tasks like retrieval-augmented generation.
  • If the reordering preserves accuracy across model families, the method points toward query-adaptive prompting that requires no per-model tuning.

Load-bearing premise

Output entropy from a short probe run is a sufficient and unbiased signal for selecting the globally optimal shot count, and the decoupling-plus-re-encoding step for KV cache reordering introduces no accuracy degradation.

What would settle it

Measure whether accuracy using the entropy-selected shot count matches or exceeds the accuracy obtained by exhaustively testing several shot counts on the same query, or compare final accuracy when the KV cache is reordered versus when the entire context is recomputed from scratch.

Figures

Figures reproduced from arXiv: 2605.03644 by Jie Ou, Jinyu Guo, Ruiqi Wu, Shiyao Guo, Wenhong Tian, Wenyi Li, Yuang Li, Zhaokun Wang.

Figure 1
Figure 1. Figure 1: (a) Comparison of Few-Shot, adaptive Many view at source ↗
Figure 2
Figure 2. Figure 2: Many-Shot ICL performance of different models across multiple datasets. view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline of AdapShot. ple, Llama-3.2-3B achieves only 23% accuracy on TriviaQA even with 1024 examples, indicating this task is extremely challenging for 3B-scale mod￾els. However, on another knowledge-intensive task, OpenBookQA, the same model achieves approx￾imately 65% performance with 64-256 examples. Furthermore, the "optimal number of examples" varies dramatically across models: Qwen2.5-7B requir… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on Position Decoupling and view at source ↗
Figure 5
Figure 5. Figure 5: Runtime comparison between AdapShot with view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of AdapShot’s dynamic shot view at source ↗
Figure 7
Figure 7. Figure 7: Scalability analysis of AdapShot. Scaling with LLM Parameters: We evaluated scalability on Qwen2.5-14B and 32B using the CoLA dataset. As shown in view at source ↗
read the original abstract

Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes AdapShot for adaptive many-shot in-context learning in LLMs. It uses a probe-based mechanism that measures output entropy on a short run to dynamically select the per-query optimal shot count, combined with a semantic-aware KV cache reuse strategy. To handle positional encoding incompatibilities when reordering cached KV pairs, it introduces a decoupling step followed by re-encoding. The central empirical claim is an average ~10% performance improvement and 4.64x speedup relative to the state-of-the-art DBSA baseline.

Significance. If the reported gains are robust, the work would be a practically significant engineering contribution to making many-shot ICL feasible at scale by reducing both compute waste on easy queries and the cost of long-context prefilling. The combination of entropy-driven adaptation and KV-cache reordering addresses two real deployment bottlenecks. The manuscript receives credit for focusing on an empirical, reproducible-style engineering solution rather than purely theoretical claims.

major comments (3)
  1. [§3.2] §3.2 (Probe-based shot selection): The central performance claim rests on the assumption that output entropy from a short probe run is a sufficient and unbiased signal for the globally optimal shot count. No oracle comparison, failure-case analysis, or statistical test is provided showing that the entropy threshold reliably selects the shot count that would have been chosen by exhaustive search; if the probe is noisy or local, the reported 10% average gain cannot be attributed to the method.
  2. [§3.3] §3.3 (Decoupling and re-encoding): The KV-cache reuse strategy claims to preserve model behavior after reordering via decoupling plus re-encoding. No ablation isolating the re-encoding step, no verification that attention scores and positional encodings remain equivalent post-reordering, and no measurement of accuracy degradation on the same queries are reported. This directly undermines attribution of both the accuracy gain and the 4.64x speedup to the proposed technique.
  3. [§4] §4 (Experiments): The experimental section reports aggregate gains versus DBSA but supplies no protocol details (random seeds, number of runs, exact prompt templates, dataset splits), no statistical significance tests, no full baseline list with hyper-parameters, and no ablation tables on the entropy threshold or re-encoding component. These omissions make the central claims unverifiable from the presented data.
minor comments (2)
  1. [Abstract] The acronym DBSA is used without an initial expansion or citation; a reference or definition should be added on first use.
  2. [§3.3] Figure captions for the KV-cache diagrams could more explicitly label the decoupling and re-encoding operations to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where additional analysis or details are needed to strengthen the claims, we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Probe-based shot selection): The central performance claim rests on the assumption that output entropy from a short probe run is a sufficient and unbiased signal for the globally optimal shot count. No oracle comparison, failure-case analysis, or statistical test is provided showing that the entropy threshold reliably selects the shot count that would have been chosen by exhaustive search; if the probe is noisy or local, the reported 10% average gain cannot be attributed to the method.

    Authors: We agree that direct validation of the probe's reliability against an oracle would strengthen attribution of the gains. The manuscript shows consistent outperformance over fixed-shot and DBSA baselines, with the entropy threshold chosen via validation-set tuning. In the revision we will add an oracle comparison on a query subset (exhaustive search for per-query optimal shots vs. probe selection), failure-case analysis, and correlation statistics between probe entropy and accuracy lift. These will appear in an expanded §3.2 and new Appendix C. revision: yes

  2. Referee: [§3.3] §3.3 (Decoupling and re-encoding): The KV-cache reuse strategy claims to preserve model behavior after reordering via decoupling plus re-encoding. No ablation isolating the re-encoding step, no verification that attention scores and positional encodings remain equivalent post-reordering, and no measurement of accuracy degradation on the same queries are reported. This directly undermines attribution of both the accuracy gain and the 4.64x speedup to the proposed technique.

    Authors: We acknowledge the need for explicit verification. The decoupling step removes positional information before semantic matching, and re-encoding restores compatibility. In the revised manuscript we will add an ablation isolating the re-encoding component, report attention-score distribution comparisons before/after reordering on representative queries, and measure accuracy on the same queries with and without the full reuse pipeline. Results will be placed in §3.3 and a new experimental table. revision: yes

  3. Referee: [§4] §4 (Experiments): The experimental section reports aggregate gains versus DBSA but supplies no protocol details (random seeds, number of runs, exact prompt templates, dataset splits), no statistical significance tests, no full baseline list with hyper-parameters, and no ablation tables on the entropy threshold or re-encoding component. These omissions make the central claims unverifiable from the presented data.

    Authors: We apologize for the missing protocol details. Experiments used 5 random seeds, standard dataset splits and default prompt templates from the source papers (MMLU, GSM8K, BBH, etc.). We will expand §4 with a dedicated Experimental Setup subsection containing all protocol information, baseline hyper-parameters, paired t-test results with p-values, and full ablation tables for the entropy threshold and re-encoding component. This will render the claims fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent experimental validation

full rationale

The paper presents AdapShot as an empirical system for adaptive many-shot ICL using an entropy-based probe for shot selection and a KV-cache reuse mechanism with decoupling/re-encoding. No derivation chain, equations, or first-principles results are claimed; performance gains (~10%) and speedups (4.64x) are reported solely from experiments against baselines like DBSA. The method does not define any quantity in terms of itself, rename fitted parameters as predictions, or rely on self-citations for load-bearing uniqueness theorems. The approach is self-contained as an engineering artifact whose claims rest on external benchmarks rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested domain assumption that entropy is a reliable proxy for optimal shot count and that KV reordering preserves model behavior; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Output entropy computed from a short probe run correlates with the number of shots that maximizes final-task accuracy.
    Invoked to justify dynamic shot selection without exhaustive search.
  • domain assumption Decoupling positional encodings and re-applying them to reordered KV pairs leaves the model's attention computation unchanged.
    Required for the cache-reuse strategy to be lossless.

pith-pipeline@v0.9.0 · 5529 in / 1306 out tokens · 51286 ms · 2026-05-07T16:34:19.320364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.