arxiv: 2605.03644 · v1 · submitted 2026-05-05 · 💻 cs.AI

Recognition: unknown

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse

Jie Ou , Jinyu Guo , Shiyao Guo , Yuang Li , Ruiqi Wu , Zhaokun Wang , Wenyi Li , Wenhong Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords many-shot in-context learningadaptive ICLKV cache reuseoutput entropyLLM inference optimizationdynamic shot selectionsemantic-aware caching

0 comments

The pith

AdapShot selects the optimal number of in-context examples for each query by measuring output entropy in a probe run and reuses KV cache with reordering to enable efficient many-shot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many-shot in-context learning works better with more examples but a fixed shot count either under-supplies hard queries or adds noise to easy ones while long contexts drive up compute costs. AdapShot runs a short probe inference on each query and uses the resulting output entropy to choose the shot count that minimizes uncertainty for that specific input. To avoid repeating the expensive prefilling step for every probe and for the final run, it reuses prior key-value pairs by first decoupling them from their original positions and then re-encoding them to match the new sequence order. This combination lets the system adapt shot counts per query without the full cost of many-shot contexts.

Core claim

AdapShot dynamically optimizes shot counts using a probe-based evaluation with output entropy and employs a semantics-aware KV cache reuse strategy with decoupling and re-encoding to bypass redundant prefilling, achieving better performance and efficiency in many-shot in-context learning.

What carries the argument

Probe-based entropy evaluation for choosing shot count, together with decoupling and re-encoding to reorder cached key-value pairs for semantic-aware reuse.

If this is right

Queries receive different numbers of shots according to their measured entropy rather than a single fixed value.
Prefilling work is skipped for both probes and final inference by reusing and reordering existing KV pairs.
Positional encoding incompatibilities are handled so that reordered cache entries remain compatible with the model.
Overall inference becomes faster while average accuracy rises compared with static or prior adaptive baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy probe could serve as a lightweight signal for adapting other ICL choices such as example ordering or format.
Similar decoupling techniques might reduce recomputation in other long-context tasks like retrieval-augmented generation.
If the reordering preserves accuracy across model families, the method points toward query-adaptive prompting that requires no per-model tuning.

Load-bearing premise

Output entropy from a short probe run is a sufficient and unbiased signal for selecting the globally optimal shot count, and the decoupling-plus-re-encoding step for KV cache reordering introduces no accuracy degradation.

What would settle it

Measure whether accuracy using the entropy-selected shot count matches or exceeds the accuracy obtained by exhaustively testing several shot counts on the same query, or compare final accuracy when the KV cache is reordered versus when the entire context is recomputed from scratch.

Figures

Figures reproduced from arXiv: 2605.03644 by Jie Ou, Jinyu Guo, Ruiqi Wu, Shiyao Guo, Wenhong Tian, Wenyi Li, Yuang Li, Zhaokun Wang.

**Figure 1.** Figure 1: (a) Comparison of Few-Shot, adaptive Many view at source ↗

**Figure 2.** Figure 2: Many-Shot ICL performance of different models across multiple datasets. view at source ↗

**Figure 3.** Figure 3: The pipeline of AdapShot. ple, Llama-3.2-3B achieves only 23% accuracy on TriviaQA even with 1024 examples, indicating this task is extremely challenging for 3B-scale models. However, on another knowledge-intensive task, OpenBookQA, the same model achieves approximately 65% performance with 64-256 examples. Furthermore, the "optimal number of examples" varies dramatically across models: Qwen2.5-7B requir… view at source ↗

**Figure 4.** Figure 4: Ablation study on Position Decoupling and view at source ↗

**Figure 5.** Figure 5: Runtime comparison between AdapShot with view at source ↗

**Figure 6.** Figure 6: Visualization of AdapShot’s dynamic shot view at source ↗

**Figure 7.** Figure 7: Scalability analysis of AdapShot. Scaling with LLM Parameters: We evaluated scalability on Qwen2.5-14B and 32B using the CoLA dataset. As shown in view at source ↗

read the original abstract

Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of Large Language Models (LLMs). However, existing methods typically rely on a predetermined, fixed number of shots. This static approach often fails to adapt to the varying difficulty of different queries, leading to either insufficient context or interference from noise. Furthermore, the prohibitive computational and memory costs of long contexts severely limit Many-Shot's feasibility. To address the above limitations, we propose AdapShot, which dynamically optimizes shot counts and leverages KV cache reuse for efficient inference. Specifically, we design a probe-based evaluation mechanism that utilizes output entropy to determine the optimal number of shots. To bypass the redundant prefilling computation during both the probing and inference phases, we incorporate a semantics-aware KV cache reuse strategy. Within this reuse strategy, to address positional encoding incompatibilities, we introduce a decoupling and re-encoding method that enables the flexible reordering of cached key-value pairs. Extensive experiments demonstrate that AdapShot achieves an average performance gain of around 10% and a 4.64x speedup compared to state-of-the-art DBSA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdapShot pairs an entropy probe for picking shot count with position-decoupled KV cache reuse, but the abstract gives no experimental details to back the 10% gain and 4.64x speedup claims.

read the letter

The main thing to know is that this paper adds a probe step that measures output entropy to choose how many shots to use per query, then reuses KV cache with a decoupling trick to reorder without full recompute. That combination targets both the variance from fixed shot counts and the cost of long contexts in many-shot ICL. The decoupling step to handle positional encodings during reuse is the concrete engineering move that lets them avoid redundant prefilling. If the full experiments hold up, the method could help people who need consistent performance on reasoning tasks without blowing up inference time. The reported average 10% lift and 4.64x speedup over DBSA would matter for deployment if they are real. The soft spots are the missing pieces. The abstract states the gains but supplies no protocol, baseline list, run counts, significance tests, or ablations on the entropy threshold or the re-encoding step. That leaves the two core assumptions untested in the provided text: that short-probe entropy is a good unbiased signal for the globally best shot count, and that the decoupling plus re-encoding does not degrade accuracy through altered positions or attention order. Those are exactly the points that need checking before the numbers can be trusted. This is for engineers and researchers who build or tune long-context ICL systems and care about both accuracy stability and speed. A reader looking for practical control loops around shot selection and cache management could extract usable ideas even if the results require follow-up verification. It deserves a serious referee to examine the full experiments and run the necessary checks on the probe reliability and cache fidelity. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes AdapShot for adaptive many-shot in-context learning in LLMs. It uses a probe-based mechanism that measures output entropy on a short run to dynamically select the per-query optimal shot count, combined with a semantic-aware KV cache reuse strategy. To handle positional encoding incompatibilities when reordering cached KV pairs, it introduces a decoupling step followed by re-encoding. The central empirical claim is an average ~10% performance improvement and 4.64x speedup relative to the state-of-the-art DBSA baseline.

Significance. If the reported gains are robust, the work would be a practically significant engineering contribution to making many-shot ICL feasible at scale by reducing both compute waste on easy queries and the cost of long-context prefilling. The combination of entropy-driven adaptation and KV-cache reordering addresses two real deployment bottlenecks. The manuscript receives credit for focusing on an empirical, reproducible-style engineering solution rather than purely theoretical claims.

major comments (3)

[§3.2] §3.2 (Probe-based shot selection): The central performance claim rests on the assumption that output entropy from a short probe run is a sufficient and unbiased signal for the globally optimal shot count. No oracle comparison, failure-case analysis, or statistical test is provided showing that the entropy threshold reliably selects the shot count that would have been chosen by exhaustive search; if the probe is noisy or local, the reported 10% average gain cannot be attributed to the method.
[§3.3] §3.3 (Decoupling and re-encoding): The KV-cache reuse strategy claims to preserve model behavior after reordering via decoupling plus re-encoding. No ablation isolating the re-encoding step, no verification that attention scores and positional encodings remain equivalent post-reordering, and no measurement of accuracy degradation on the same queries are reported. This directly undermines attribution of both the accuracy gain and the 4.64x speedup to the proposed technique.
[§4] §4 (Experiments): The experimental section reports aggregate gains versus DBSA but supplies no protocol details (random seeds, number of runs, exact prompt templates, dataset splits), no statistical significance tests, no full baseline list with hyper-parameters, and no ablation tables on the entropy threshold or re-encoding component. These omissions make the central claims unverifiable from the presented data.

minor comments (2)

[Abstract] The acronym DBSA is used without an initial expansion or citation; a reference or definition should be added on first use.
[§3.3] Figure captions for the KV-cache diagrams could more explicitly label the decoupling and re-encoding operations to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where additional analysis or details are needed to strengthen the claims, we will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Probe-based shot selection): The central performance claim rests on the assumption that output entropy from a short probe run is a sufficient and unbiased signal for the globally optimal shot count. No oracle comparison, failure-case analysis, or statistical test is provided showing that the entropy threshold reliably selects the shot count that would have been chosen by exhaustive search; if the probe is noisy or local, the reported 10% average gain cannot be attributed to the method.

Authors: We agree that direct validation of the probe's reliability against an oracle would strengthen attribution of the gains. The manuscript shows consistent outperformance over fixed-shot and DBSA baselines, with the entropy threshold chosen via validation-set tuning. In the revision we will add an oracle comparison on a query subset (exhaustive search for per-query optimal shots vs. probe selection), failure-case analysis, and correlation statistics between probe entropy and accuracy lift. These will appear in an expanded §3.2 and new Appendix C. revision: yes
Referee: [§3.3] §3.3 (Decoupling and re-encoding): The KV-cache reuse strategy claims to preserve model behavior after reordering via decoupling plus re-encoding. No ablation isolating the re-encoding step, no verification that attention scores and positional encodings remain equivalent post-reordering, and no measurement of accuracy degradation on the same queries are reported. This directly undermines attribution of both the accuracy gain and the 4.64x speedup to the proposed technique.

Authors: We acknowledge the need for explicit verification. The decoupling step removes positional information before semantic matching, and re-encoding restores compatibility. In the revised manuscript we will add an ablation isolating the re-encoding component, report attention-score distribution comparisons before/after reordering on representative queries, and measure accuracy on the same queries with and without the full reuse pipeline. Results will be placed in §3.3 and a new experimental table. revision: yes
Referee: [§4] §4 (Experiments): The experimental section reports aggregate gains versus DBSA but supplies no protocol details (random seeds, number of runs, exact prompt templates, dataset splits), no statistical significance tests, no full baseline list with hyper-parameters, and no ablation tables on the entropy threshold or re-encoding component. These omissions make the central claims unverifiable from the presented data.

Authors: We apologize for the missing protocol details. Experiments used 5 random seeds, standard dataset splits and default prompt templates from the source papers (MMLU, GSM8K, BBH, etc.). We will expand §4 with a dedicated Experimental Setup subsection containing all protocol information, baseline hyper-parameters, paired t-test results with p-values, and full ablation tables for the entropy threshold and re-encoding component. This will render the claims fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution with independent experimental validation

full rationale

The paper presents AdapShot as an empirical system for adaptive many-shot ICL using an entropy-based probe for shot selection and a KV-cache reuse mechanism with decoupling/re-encoding. No derivation chain, equations, or first-principles results are claimed; performance gains (~10%) and speedups (4.64x) are reported solely from experiments against baselines like DBSA. The method does not define any quantity in terms of itself, rename fitted parameters as predictions, or rely on self-citations for load-bearing uniqueness theorems. The approach is self-contained as an engineering artifact whose claims rest on external benchmarks rather than internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested domain assumption that entropy is a reliable proxy for optimal shot count and that KV reordering preserves model behavior; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Output entropy computed from a short probe run correlates with the number of shots that maximizes final-task accuracy.
Invoked to justify dynamic shot selection without exhaustive search.
domain assumption Decoupling positional encodings and re-applying them to reordered KV pairs leaves the model's attention computation unchanged.
Required for the cache-reuse strategy to be lossless.

pith-pipeline@v0.9.0 · 5529 in / 1306 out tokens · 51286 ms · 2026-05-07T16:34:19.320364+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Shantanu Acharya, Fei Jia, and Boris Ginsburg. 2024. Star attention: Efficient llm inference over long sequences. In Forty-second International Conference on Machine Learning

2024
[2]

Rishabh Agarwal, Avi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, and 1 others. 2024. Many-shot in-context learning. Advances in Neural Information Processing Systems, 37:76930--76966

2024
[3]

Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. 2025. In-context learning with long-context models: An in-depth exploration. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Paper...

2025
[4]

Rahul Atul Bhope, Praveen Venkateswaran, KR Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venkatasubramanian. 2025. Optiseq: Ordering examples on-the-fly for in-context learning. arXiv preprint arXiv:2501.15030

work page arXiv 2025
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

2020
[6]

Sihan Cao, Jianwei Zhang, Pengcheng Zheng, Jiaxin Yan, Caiyan Qin, Yalan Ye, Wei Dong, Peng Wang, Yang Yang, and Chaoning Zhang. 2026. Language-guided token compression with reinforcement learning in large vision-language models. arXiv preprint arXiv:2603.13394

work page arXiv 2026
[7]

Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, and 1 others. 2025 a . Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375

work page arXiv 2025
[8]

Zihan Chen, Song Wang, Zhen Tan, Jundong Li, and Cong Shen. 2025 b . Maple: Many-shot adaptive pseudo-labeling for in-context learning. In Forty-second International Conference on Machine Learning

2025
[9]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509

work page internal anchor Pith review arXiv 2019
[10]

Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. 2025. Towards compute-optimal many-shot in-context learning. In Second Conference on Language Modeling

2025
[11]

Zhengyao Gu, Henry Peng Zou, Aiwei Liu, Yankai Chen, Weizhi Zhang, and Philip S Yu. 2025. Scaling laws for many-shot in-context learning with self-generated annotations. In ICML 2025 Workshop on Long-Context Foundation Models

2025
[12]

Yinhan He, Wendy Zheng, Song Wang, Zaiyi Zheng, Yushun Dong, Yaochen Zhu, and Jundong Li. 2024. Hierarchical demonstration order optimization for many-shot in-context learning

2024
[13]

Seungone Kim, Se June Joo, Doyoung Kim, Joel Jang, Seonghyeon Ye, Jamin Shin, and Minjoon Seo. 2023. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045

work page arXiv 2023
[14]

Mukai Li, Shansan Gong, Jiangtao Feng, Yiheng Xu, Jun Zhang, Zhiyong Wu, and Lingpeng Kong. 2023. In-context learning with many demonstration examples. arXiv preprint arXiv:2302.04931

work page arXiv 2023
[15]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947--22970

2024
[16]

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. 2024. Minicache: Kv cache compression in depth dimension for large language models. Advances in Neural Information Processing Systems, 37:139997--140031

2024
[17]

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, and Xiaowen Chu. 2025. Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference. arXiv preprint arXiv:2502.00299

work page arXiv 2025
[18]

Saeed Moayedpour, Alejandro Corrochano-Navarro, Faryad Sahneh, Alexander Koetter, Ji r \' Vym e tal, Lorenzo Kogler Anele, Pablo Mas, Yasser Jangjoo, Sizhen Li, Michael Bailey, and 1 others. 2024. Many-shot in-context learning for molecular inverse design. In ICML 2024 AI for Science Workshop

2024
[19]

Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. Parallel context windows for large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 6383--6402

2023
[20]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review arXiv 2023
[21]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30

2017
[22]

Xingchen Wan, Han Zhou, Ruoxi Sun, and Sercan O Arik. 2025. From few to many: Self-improving many-shot reasoners through iterative optimization and generation. In The Thirteenth International Conference on Learning Representations

2025
[23]

Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, and Tianyu Pang. 2024. When precision meets position: Bfloat16 breaks down rope in long-context training. Transactions on Machine Learning Research

2024
[24]

Emily Xiao, Chin-Jou Li, Yilin Zhang, Graham Neubig, and Amanda Bertsch. 2025. https://doi.org/10.18653/v1/2025.acl-long.1542 Efficient many-shot in-context learning with dynamic block-sparse attention . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31946--31958, Vienna, Austria. ...

work page doi:10.18653/v1/2025.acl-long.1542 2025
[25]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations

2023
[26]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 1480--1489

2016
[27]

Qingyu Yin, Xuzheng He, Chak Tou Leong, Fan Wang, Yanzhao Yan, Xiaoyu Shen, and Qiang Zhang. 2024. Deeper insights without updates: The power of in-context learning over fine-tuning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4138--4151

2024
[28]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and 1 others. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283--17297

2020
[29]

Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang, Yitian Zhou, Pengcheng Zheng, Chi-lok Andy Tai, Sung-Ho Bae, Zeyu Ma, and 1 others. 2026 a . Tda-rc: Task-driven alignment for knowledge-based reasoning chains in large language models. arXiv preprint arXiv:2604.04942

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Xudong Wang, Zhenzhen Huang, Pengcheng Zheng, Shuai Yuan, Sheng Zheng, Qigan Sun, Jie Zou, and 1 others. 2026 b . Learning global hypothesis space for enhancing synergistic reasoning chain. arXiv preprint arXiv:2602.09794

work page arXiv 2026
[31]

Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John Lui, and Haibo Chen. 2024. Unifying kv cache compression for large language models with leankv. arXiv preprint arXiv:2412.03131

work page arXiv 2024
[32]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661--34710

2023
[33]

Pengcheng Zheng, Xiaorong Pu, Kecheng Chen, Jiaxin Huang, Meng Yang, Bai Feng, Yazhou Ren, Jianan Jiang, Chaoning Zhang, Yang Yang, and 1 others. 2025. Joint lossless compression and steganography for medical images via large language models. arXiv preprint arXiv:2508.01782

work page arXiv 2025
[34]

Pengcheng Zheng, Chaoning Zhang, Jiarong Mo, GuoHui Li, Jiaquan Zhang, Jiahao Zhang, Sihan Cao, Sheng Zheng, Caiyan Qin, Guoqing Wang, and 1 others. 2026. Llava-fa: Learning fourier approximation for compressing large multimodal models. arXiv preprint arXiv:2602.00135

work page arXiv 2026
[35]

Kaijian Zou, Muhammad Khalifa, and Lu Wang. 2025. On many-shot in-context learning for long-context evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25605--25639

2025