pith. sign in

arxiv: 2605.13511 · v2 · pith:A2CQIYS2new · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Pith reviewed 2026-05-14 19:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords many-shot in-context learningchain-of-thought promptingdemonstration orderingtest-time learningreasoning taskscurvilinear selectionLLM scaling
0
0 comments X

The pith

Many-shot chain-of-thought in-context learning behaves as test-time learning when demonstrations are ordered for smooth conceptual progression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that many-shot CoT-ICL does not follow the same scaling rules as non-reasoning ICL. Increasing the number of chain-of-thought demonstrations produces unstable results on non-reasoning models and tasks but benefits reasoning-oriented models. Semantic similarity retrieval also fails for reasoning because it does not capture procedural compatibility. The authors interpret these patterns by treating the long prompt as a curriculum for in-context test-time learning rather than mere pattern matching. From this view they derive two principles for demonstration choice and introduce Curvilinear Demonstration Selection, an ordering technique that improves accuracy by up to 5.42 points on geometry problems with 64 examples.

Core claim

Across both reasoning and non-reasoning LLMs and tasks, many-shot CoT-ICL exhibits a setting-dependent scaling effect, fails under similarity-based retrieval because semantic similarity poorly predicts CoT compatibility, and displays growing performance variance with demonstration order. Viewing the setup as in-context test-time learning rather than scaled pattern matching yields two principles: demonstrations should be easy for the target model to understand and should be ordered to support smooth conceptual progression. Guided by these principles, Curvilinear Demonstration Selection produces consistent gains, reframing the long context window as a structured curriculum.

What carries the argument

Curvilinear Demonstration Selection (CDS), an ordering method that arranges demonstrations to follow a smooth conceptual progression so the model can perform in-context test-time learning.

If this is right

  • Reasoning-oriented models benefit from many more CoT demonstrations once ordering respects conceptual progression.
  • Similarity-based retrieval must be replaced by procedural-compatibility measures for reasoning tasks.
  • Performance variance grows with demonstration count unless order supports smooth progression.
  • Long context windows function as curricula rather than simple retrieval buffers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering principle could be applied to non-reasoning tasks by first identifying what the model finds easy to parse.
  • Models might internally assess demonstration difficulty and reorder prompts dynamically at inference time.
  • Curriculum design for in-context learning could be tested on tasks beyond geometry by measuring step-by-step mastery.
  • If test-time learning is the mechanism, training objectives that reward smooth progression in synthetic data might amplify the effect.

Load-bearing premise

The observed gains arise because ordered demonstrations enable the model to perform test-time learning rather than because of raw prompt length or model-specific artifacts.

What would settle it

Randomly ordering the same 64 demonstrations or replacing the curvilinear order with any non-progressive sequence should produce comparable accuracy gains if the test-time learning account is incorrect.

Figures

Figures reproduced from arXiv: 2605.13511 by Dit-Yan Yeung, Lemao Liu, Mo Yu, Tsz Ting Chung.

Figure 1
Figure 1. Figure 1: Reframing of CoT-ICL as in-context test-time learning. 1. Introduction In-context learning (ICL) enables large language models (LLMs) to perform tasks by conditioning on a sequence of input-output demonstrations without updating their parame￾ters (Min et al., 2022; Von Oswald et al., 2023). Research has focused on improving ICL through strategies like se￾lecting effective demonstrations (Sorensen et al., 2… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling disparity between model types on math reasoning tasks. Left: Llama 3.3 (non-reasoning LLM) shows negative gains. Right: QwQ (32B) and R1 (685 B) (reasoning LLM) shows clear positive scaling. Model geometry number theory DetectiveQA Qwen3-14B (en) 73.07 91.30 72.73 Qwen3-14B (dis) 65.76 88.15 72.73 Qwen3-8B (en) 67.01 84.63 69.48 Qwen3-8B (dis) 62.63 79.81 66.88 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling disparity between task types. Performance (normalized accuracy) of non-reasoning LLMs on classification tasks (warm colors) versus reasoning tasks (cool colors). The x￾axis represents normalized accuracy (i.e., x−x¯ σx for accuracy x), while the y-axis indicates the number of in-context demonstrations. 4. Properties of CoT-ICL 4.1. Scaling with Reasoning Tasks Prior work reports that many-shot ICL … view at source ↗
Figure 4
Figure 4. Figure 4: Positive scaling of reasoning LLMs. The Qwen3 family (reasoning LLMs) demonstrates consistent performance improve￾ments with more demonstrations on math reasoning tasks. Left: Qwen3 (8B) Right: Qwen3 (14B) CoT-ICL for reasoning is unstable for non-reasoning LLMs and improves mainly for reasoning-optimized LLMs. For positive scaling effect, a common explanation for why many-shot ICL works is the retrieval h… view at source ↗
Figure 6
Figure 6. Figure 6: Standard deviation of performance across five random demonstration orders on classification tasks (warm colors) ver￾sus reasoning tasks (cool colors), where nt corresponds to num￾ber theory. Results shown for Qwen2.5 (14B) (non-reasoning) and Qwen3 (14B) (reasoning). et al., 2024). We quantify order sensitivity by sampling five random per￾mutations of the same demonstration set and measuring the standard d… view at source ↗
Figure 7
Figure 7. Figure 7: Performance of two sets of self-generated in-context CoT, including the set filtered with only correct answer(cr) and the set filtered with only wrong answer(wr). crqwen14 is prompting the LLaMA model with the in-context CoT generated by Qwen 2.5 (14B). Left: Llama 3.1 Right: Qwen 2.5 (14B) rameter sizes under the provision of dataset-provided CoT￾ICL. A plausible explanation is that the reasoning within t… view at source ↗
Figure 8
Figure 8. Figure 8: Performance of the first set of self-generated in-context CoT. firstqwen3(14b) is prompting the Qwen 3 (8B) model with the in-context CoT generated by Qwen 3 (14B). Left: Qwen 3 (8B) Right: Qwen 3 (14B) should depend on procedural content. This representation is designed to capture not only topical similarity but also the logical structures and operations expressed in the CoT rationale. For efficient and s… view at source ↗
Figure 9
Figure 9. Figure 9: Performance with original (ori), similarity(sim) and dissimilar(dis) sets averaged across three non-reasoning LLMs. The area between the two sets is filled with colors, indicating the relative performance at each point. 20 60 100 20 45 70 BANKING77 Sim>Ori/Dis Original Most Similar Most Dissimilar 20 60 100 67 69 71 DetectiveQA Ori/Dis>Sim Original Most Similar Most Dissimilar 20 60 100 36 45 56 geometry O… view at source ↗
Figure 10
Figure 10. Figure 10: Performance with original (ori), similarity(sim) and dissimilar(dis) sets averaged across two reasoning LLMs. The area between the two sets is filled with colors, indicating the relative performance at each point. B. Statistical Robustness on a New ICL Subset We compute the mean and standard deviation across five random demonstration-ordering seeds, and repeat the analy￾sis on a newly sampled ICL subset. … view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for WSC task D. CDS: Details and Implementation D.1. Model Studies We design our experiments to isolate the effect of demon￾stration ordering in many-shot CoT-ICL. To this end, we focus on reasoning-oriented LLMs that exhibit a positive scaling trend with more demonstrations, since such models demonstrate in-context learning capacity and should benefit from improved ordering. We also control for co… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for BANKING77 task Given a question, predict the label of the question. You can only make predictions from the following categories: { LIST_OF_CATEGORIES} Please predict the intent category of the FINAL utterance with the provided demonstration example queries as follows: utterance: {question_1} intent category: {label_1} ... utterance: {question_n} intent category: {label_n} Now predict the intent… view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for NLU task The prompt for inference is presented in [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for GSM8K task Write a response that appropriately completes the request and wrap the final answer inside \boxed{}. Problem: {question_1} Solution: {CoT_with_answer_1} ... Problem: {question_n} Solution: {CoT_with_answer_n} ### Problem: {question_t} ### Solution: Let’s think step by step [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Unified prompt for MATH task Below is an instruction that describes a task.\n Select the correct option from A/B/C/D. Answer with ’The answer is {A/ B/C/D}.’ in the end of your response.\n\n" Question: {question_1} Context: {context_1} Options: A. {option_1_1} B. {option_1_2} C. {option_1_3} D. {option_1_4} Answer: {derivation_1} The answer is {answer_1}. ... Question: {question_n} Context: {context_n} Op… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for DetectiveQA task 17 [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript studies many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning tasks. It reports three main empirical patterns: (i) a setting-dependent scaling effect in which adding more CoT demonstrations is unstable for non-reasoning LLMs but beneficial for reasoning-oriented models; (ii) failure of semantic similarity retrieval on reasoning tasks because it does not capture procedural compatibility; and (iii) an order-scaling effect in which performance variance grows with the number of demonstrations. The authors interpret these behaviors as evidence that many-shot CoT-ICL functions as in-context test-time learning rather than scaled pattern matching, propose two guiding principles (demonstrations should be easy for the model to understand and should be ordered for smooth conceptual progression), and introduce Curvilinear Demonstration Selection (CDS) as a concrete ordering heuristic that yields up to a 5.42 percentage-point gain on geometry tasks with 64 demonstrations.

Significance. If the central empirical patterns and the attribution of gains to ordering hold, the work supplies a useful conceptual reframing of long-context ICL as structured curriculum-style learning and a practical, low-overhead method (CDS) that improves reasoning performance. The cross-model and cross-task consistency of the reported scaling behaviors is a positive feature that could inform future prompt-engineering practice.

major comments (3)
  1. [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.
  2. [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.
  3. [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.
minor comments (2)
  1. [Methods] Define 'reasoning-oriented LLMs' versus 'non-reasoning LLMs' more explicitly in the methods section, including the criteria used for classification.
  2. [Figures and tables] Add error bars or confidence intervals to all performance plots and tables that report scaling curves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us strengthen the empirical rigor and clarity of the manuscript. We have revised the paper to address the concerns about ablations, variance reporting, method operationalization, and controls. Below we respond point-by-point to each major comment.

read point-by-point responses
  1. Referee: [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.

    Authors: We agree that isolating the contribution of ordering requires holding the demonstration set fixed. In the revised manuscript we have added a controlled ablation on the geometry task with 64 shots: we fix the exact set of demonstrations selected by CDS and compare performance under (a) the original curvilinear order versus (b) a random permutation of the same set. The curvilinear order yields a statistically significant improvement over random order on the fixed set, supporting that the ordering itself drives part of the gain and bolstering the smooth-progression principle. We have also clarified in Section 4.2 that CDS first selects candidates by procedural compatibility heuristics and then orders them; the new ablation separates these stages. revision: yes

  2. Referee: [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.

    Authors: We acknowledge these omissions. The revised version now reports standard deviations across five independent runs (different random seeds for ordering and model sampling) for all scaling curves and CDS results. We have added a precise algorithmic description of CDS, including pseudocode for the curvilinear traversal, in the methods section. To address token-budget confounds, we include a controlled experiment that truncates all prompts to the same maximum token length when scaling from 4 to 64 shots; the reported scaling trends and CDS gains remain consistent under this control. revision: yes

  3. Referee: [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.

    Authors: The test-time-learning framing is indeed interpretive and derived from the combination of the three observed patterns rather than from dedicated diagnostic probes. We have expanded the discussion to explicitly contrast this account with alternatives (improved coverage, reduced interference) and to acknowledge the absence of incremental probing experiments as a limitation. The new fixed-set ordering ablation helps differentiate ordering effects from pure coverage, but we agree that stronger causal evidence would require additional experiments (e.g., step-wise concept probes) that are beyond the scope of the current revision. revision: partial

standing simulated objections not resolved
  • Direct diagnostic experiments (incremental probing of concept acquisition) to causally distinguish the test-time learning interpretation from alternatives such as coverage or interference.

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical observations and heuristic proposal

full rationale

The paper reports empirical scaling effects for many-shot CoT-ICL, interprets them as in-context test-time learning, states two principles, and introduces CDS as a simple ordering heuristic guided by those principles. No derivation reduces by construction to its inputs: there are no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz smuggled via prior work. Performance gains are presented as experimental results on specific tasks and models rather than tautological outcomes of the framing itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can perform test-time learning from ordered demonstrations; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption LLMs perform in-context test-time learning that improves when demonstrations are ordered for smooth conceptual progression
    Invoked to explain why standard many-shot rules fail and to motivate the CDS method.

pith-pipeline@v0.9.0 · 5614 in / 1277 out tokens · 55230 ms · 2026-05-14T19:12:41.184249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.