Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Pith reviewed 2026-05-14 19:12 UTC · model grok-4.3
The pith
Many-shot chain-of-thought in-context learning behaves as test-time learning when demonstrations are ordered for smooth conceptual progression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across both reasoning and non-reasoning LLMs and tasks, many-shot CoT-ICL exhibits a setting-dependent scaling effect, fails under similarity-based retrieval because semantic similarity poorly predicts CoT compatibility, and displays growing performance variance with demonstration order. Viewing the setup as in-context test-time learning rather than scaled pattern matching yields two principles: demonstrations should be easy for the target model to understand and should be ordered to support smooth conceptual progression. Guided by these principles, Curvilinear Demonstration Selection produces consistent gains, reframing the long context window as a structured curriculum.
What carries the argument
Curvilinear Demonstration Selection (CDS), an ordering method that arranges demonstrations to follow a smooth conceptual progression so the model can perform in-context test-time learning.
If this is right
- Reasoning-oriented models benefit from many more CoT demonstrations once ordering respects conceptual progression.
- Similarity-based retrieval must be replaced by procedural-compatibility measures for reasoning tasks.
- Performance variance grows with demonstration count unless order supports smooth progression.
- Long context windows function as curricula rather than simple retrieval buffers.
Where Pith is reading between the lines
- The same ordering principle could be applied to non-reasoning tasks by first identifying what the model finds easy to parse.
- Models might internally assess demonstration difficulty and reorder prompts dynamically at inference time.
- Curriculum design for in-context learning could be tested on tasks beyond geometry by measuring step-by-step mastery.
- If test-time learning is the mechanism, training objectives that reward smooth progression in synthetic data might amplify the effect.
Load-bearing premise
The observed gains arise because ordered demonstrations enable the model to perform test-time learning rather than because of raw prompt length or model-specific artifacts.
What would settle it
Randomly ordering the same 64 demonstrations or replacing the curvilinear order with any non-progressive sequence should produce comparable accuracy gains if the test-time learning account is incorrect.
Figures
read the original abstract
While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning tasks. It reports three main empirical patterns: (i) a setting-dependent scaling effect in which adding more CoT demonstrations is unstable for non-reasoning LLMs but beneficial for reasoning-oriented models; (ii) failure of semantic similarity retrieval on reasoning tasks because it does not capture procedural compatibility; and (iii) an order-scaling effect in which performance variance grows with the number of demonstrations. The authors interpret these behaviors as evidence that many-shot CoT-ICL functions as in-context test-time learning rather than scaled pattern matching, propose two guiding principles (demonstrations should be easy for the model to understand and should be ordered for smooth conceptual progression), and introduce Curvilinear Demonstration Selection (CDS) as a concrete ordering heuristic that yields up to a 5.42 percentage-point gain on geometry tasks with 64 demonstrations.
Significance. If the central empirical patterns and the attribution of gains to ordering hold, the work supplies a useful conceptual reframing of long-context ICL as structured curriculum-style learning and a practical, low-overhead method (CDS) that improves reasoning performance. The cross-model and cross-task consistency of the reported scaling behaviors is a positive feature that could inform future prompt-engineering practice.
major comments (3)
- [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.
- [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.
- [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.
minor comments (2)
- [Methods] Define 'reasoning-oriented LLMs' versus 'non-reasoning LLMs' more explicitly in the methods section, including the criteria used for classification.
- [Figures and tables] Add error bars or confidence intervals to all performance plots and tables that report scaling curves.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has helped us strengthen the empirical rigor and clarity of the manuscript. We have revised the paper to address the concerns about ablations, variance reporting, method operationalization, and controls. Below we respond point-by-point to each major comment.
read point-by-point responses
-
Referee: [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.
Authors: We agree that isolating the contribution of ordering requires holding the demonstration set fixed. In the revised manuscript we have added a controlled ablation on the geometry task with 64 shots: we fix the exact set of demonstrations selected by CDS and compare performance under (a) the original curvilinear order versus (b) a random permutation of the same set. The curvilinear order yields a statistically significant improvement over random order on the fixed set, supporting that the ordering itself drives part of the gain and bolstering the smooth-progression principle. We have also clarified in Section 4.2 that CDS first selects candidates by procedural compatibility heuristics and then orders them; the new ablation separates these stages. revision: yes
-
Referee: [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.
Authors: We acknowledge these omissions. The revised version now reports standard deviations across five independent runs (different random seeds for ordering and model sampling) for all scaling curves and CDS results. We have added a precise algorithmic description of CDS, including pseudocode for the curvilinear traversal, in the methods section. To address token-budget confounds, we include a controlled experiment that truncates all prompts to the same maximum token length when scaling from 4 to 64 shots; the reported scaling trends and CDS gains remain consistent under this control. revision: yes
-
Referee: [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.
Authors: The test-time-learning framing is indeed interpretive and derived from the combination of the three observed patterns rather than from dedicated diagnostic probes. We have expanded the discussion to explicitly contrast this account with alternatives (improved coverage, reduced interference) and to acknowledge the absence of incremental probing experiments as a limitation. The new fixed-set ordering ablation helps differentiate ordering effects from pure coverage, but we agree that stronger causal evidence would require additional experiments (e.g., step-wise concept probes) that are beyond the scope of the current revision. revision: partial
- Direct diagnostic experiments (incremental probing of concept acquisition) to causally distinguish the test-time learning interpretation from alternatives such as coverage or interference.
Circularity Check
No significant circularity; claims rest on empirical observations and heuristic proposal
full rationale
The paper reports empirical scaling effects for many-shot CoT-ICL, interprets them as in-context test-time learning, states two principles, and introduces CDS as a simple ordering heuristic guided by those principles. No derivation reduces by construction to its inputs: there are no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz smuggled via prior work. Performance gains are presented as experimental results on specific tasks and models rather than tautological outcomes of the framing itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs perform in-context test-time learning that improves when demonstrations are ordered for smooth conceptual progression
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.