Many-Shot In-Context Learning

Aleksandra Faust; Ankesh Anand; Avi Singh; Azade Nova; Bernd Bohnet; Biao Zhang; Eric Chu; Feryal Behbahani; Hugo Larochelle; John D. Co-Reyes

arxiv: 2404.11018 · v3 · pith:UG7AKLU6new · submitted 2024-04-17 · 💻 cs.LG · cs.AI· cs.CL

Many-Shot In-Context Learning

Rishabh Agarwal , Avi Singh , Lei M. Zhang , Bernd Bohnet , Luis Rosias , Stephanie Chan , Biao Zhang , Ankesh Anand

show 7 more authors

Zaheer Abbas Azade Nova John D. Co-Reyes Eric Chu Feryal Behbahani Aleksandra Faust Hugo Larochelle

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords many-shotlearningexamplesfew-shotregimereinforcedunsupervisedcontext

0 comments

read the original abstract

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
cs.CL 2025-02 unverdicted novelty 7.0

KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation...
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
When Youth Enter the Algorithmic Wild: Discovering and Understanding Potentially Harmful Teen Videos on Douyin and Kwai
cs.CR 2026-05 unverdicted novelty 6.0

PHTV-Scout measures 6.11% prevalence of potentially harmful teen videos on Douyin and Kwai (53.2% child sexual exploitation imagery), shows Youth Mode blocks all such content but is used by only 30-41% of teens, and a...
AMEL: Accumulated Message Effects on LLM Judgments
cs.AI 2026-05 unverdicted novelty 6.0

LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
AMEL: Accumulated Message Effects on LLM Judgments
cs.AI 2026-05 conditional novelty 6.0

LLMs exhibit an accumulated message effect where conversation history saturated with positive or negative evaluations biases subsequent judgments, with larger shifts on uncertain items, a negativity asymmetry, and no ...
"AI Psychosis" in Context: How Conversation History Shapes LLM Responses to Delusional Beliefs
cs.HC 2026-04 unverdicted novelty 6.0

Longer conversation histories cause some LLMs to reinforce delusional beliefs more while others activate stronger safety responses using the established context.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
cs.AI 2026-04 unverdicted novelty 6.0

Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
cs.CL 2024-12 accept novelty 6.0

LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
cs.LG 2026-06 unverdicted novelty 5.0

A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
cs.CV 2026-06 unverdicted novelty 5.0

TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.
What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems
cs.AI 2026-06 unverdicted novelty 5.0

Introduces PACT protocol that projects agent outputs into action-state records, yielding comparable or better task performance with substantially fewer tokens in multi-agent LLM systems and production harnesses.
Structuring Human-AI Productive Interdependence by Strategic Level of Automation Selection for Qualitative Inquiry
cs.HC 2026-05 unverdicted novelty 5.0

Proposes a formal framework based on Interdependence Theory to select Levels of Automation for qualitative analysis stages by assessing task risk and validation cost, shown in a case study with three design principles.
LLMs with in-context learning for Algorithmic Theoretical Physics
cs.LG 2026-05 unverdicted novelty 5.0

Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems
cs.CL 2025-09 unverdicted novelty 5.0

RECAP is an inference-time framework using cognitive appraisal theory to enhance emotional alignment and transparency in medical dialogue systems across model scales.
CLaaS: Continual learning as a service for sample efficient online learning
cs.LG 2026-06 unverdicted novelty 4.0

CLaaS enables sample-efficient online continual learning for agents via replay-buffered parametric updates, outperforming in-context learning in forward transfer and retention on an adversarial task.
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
cs.CL 2025-09 unverdicted novelty 3.0

The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.