pith. machine review for the scientific record. sign in

arxiv: 2605.13511 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI

Recognition: unknown

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords many-shot in-context learningchain-of-thought promptingdemonstration orderingtest-time learningreasoning taskscurvilinear selectionLLM scaling
0
0 comments X

The pith

Many-shot chain-of-thought in-context learning behaves as test-time learning when demonstrations are ordered for smooth conceptual progression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that many-shot CoT-ICL does not follow the same scaling rules as non-reasoning ICL. Increasing the number of chain-of-thought demonstrations produces unstable results on non-reasoning models and tasks but benefits reasoning-oriented models. Semantic similarity retrieval also fails for reasoning because it does not capture procedural compatibility. The authors interpret these patterns by treating the long prompt as a curriculum for in-context test-time learning rather than mere pattern matching. From this view they derive two principles for demonstration choice and introduce Curvilinear Demonstration Selection, an ordering technique that improves accuracy by up to 5.42 points on geometry problems with 64 examples.

Core claim

Across both reasoning and non-reasoning LLMs and tasks, many-shot CoT-ICL exhibits a setting-dependent scaling effect, fails under similarity-based retrieval because semantic similarity poorly predicts CoT compatibility, and displays growing performance variance with demonstration order. Viewing the setup as in-context test-time learning rather than scaled pattern matching yields two principles: demonstrations should be easy for the target model to understand and should be ordered to support smooth conceptual progression. Guided by these principles, Curvilinear Demonstration Selection produces consistent gains, reframing the long context window as a structured curriculum.

What carries the argument

Curvilinear Demonstration Selection (CDS), an ordering method that arranges demonstrations to follow a smooth conceptual progression so the model can perform in-context test-time learning.

If this is right

  • Reasoning-oriented models benefit from many more CoT demonstrations once ordering respects conceptual progression.
  • Similarity-based retrieval must be replaced by procedural-compatibility measures for reasoning tasks.
  • Performance variance grows with demonstration count unless order supports smooth progression.
  • Long context windows function as curricula rather than simple retrieval buffers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering principle could be applied to non-reasoning tasks by first identifying what the model finds easy to parse.
  • Models might internally assess demonstration difficulty and reorder prompts dynamically at inference time.
  • Curriculum design for in-context learning could be tested on tasks beyond geometry by measuring step-by-step mastery.
  • If test-time learning is the mechanism, training objectives that reward smooth progression in synthetic data might amplify the effect.

Load-bearing premise

The observed gains arise because ordered demonstrations enable the model to perform test-time learning rather than because of raw prompt length or model-specific artifacts.

What would settle it

Randomly ordering the same 64 demonstrations or replacing the curvilinear order with any non-progressive sequence should produce comparable accuracy gains if the test-time learning account is incorrect.

Figures

Figures reproduced from arXiv: 2605.13511 by Dit-Yan Yeung, Lemao Liu, Mo Yu, Tsz Ting Chung.

Figure 1
Figure 1. Figure 1: Reframing of CoT-ICL as in-context test-time learning. 1. Introduction In-context learning (ICL) enables large language models (LLMs) to perform tasks by conditioning on a sequence of input-output demonstrations without updating their parame￾ters (Min et al., 2022; Von Oswald et al., 2023). Research has focused on improving ICL through strategies like se￾lecting effective demonstrations (Sorensen et al., 2… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling disparity between model types on math reasoning tasks. Left: Llama 3.3 (non-reasoning LLM) shows negative gains. Right: QwQ (32B) and R1 (685 B) (reasoning LLM) shows clear positive scaling. Model geometry number theory DetectiveQA Qwen3-14B (en) 73.07 91.30 72.73 Qwen3-14B (dis) 65.76 88.15 72.73 Qwen3-8B (en) 67.01 84.63 69.48 Qwen3-8B (dis) 62.63 79.81 66.88 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling disparity between task types. Performance (normalized accuracy) of non-reasoning LLMs on classification tasks (warm colors) versus reasoning tasks (cool colors). The x￾axis represents normalized accuracy (i.e., x−x¯ σx for accuracy x), while the y-axis indicates the number of in-context demonstrations. 4. Properties of CoT-ICL 4.1. Scaling with Reasoning Tasks Prior work reports that many-shot ICL … view at source ↗
Figure 4
Figure 4. Figure 4: Positive scaling of reasoning LLMs. The Qwen3 family (reasoning LLMs) demonstrates consistent performance improve￾ments with more demonstrations on math reasoning tasks. Left: Qwen3 (8B) Right: Qwen3 (14B) CoT-ICL for reasoning is unstable for non-reasoning LLMs and improves mainly for reasoning-optimized LLMs. For positive scaling effect, a common explanation for why many-shot ICL works is the retrieval h… view at source ↗
Figure 6
Figure 6. Figure 6: Standard deviation of performance across five random demonstration orders on classification tasks (warm colors) ver￾sus reasoning tasks (cool colors), where nt corresponds to num￾ber theory. Results shown for Qwen2.5 (14B) (non-reasoning) and Qwen3 (14B) (reasoning). et al., 2024). We quantify order sensitivity by sampling five random per￾mutations of the same demonstration set and measuring the standard d… view at source ↗
Figure 7
Figure 7. Figure 7: Performance of two sets of self-generated in-context CoT, including the set filtered with only correct answer(cr) and the set filtered with only wrong answer(wr). crqwen14 is prompting the LLaMA model with the in-context CoT generated by Qwen 2.5 (14B). Left: Llama 3.1 Right: Qwen 2.5 (14B) rameter sizes under the provision of dataset-provided CoT￾ICL. A plausible explanation is that the reasoning within t… view at source ↗
Figure 8
Figure 8. Figure 8: Performance of the first set of self-generated in-context CoT. firstqwen3(14b) is prompting the Qwen 3 (8B) model with the in-context CoT generated by Qwen 3 (14B). Left: Qwen 3 (8B) Right: Qwen 3 (14B) should depend on procedural content. This representation is designed to capture not only topical similarity but also the logical structures and operations expressed in the CoT rationale. For efficient and s… view at source ↗
Figure 9
Figure 9. Figure 9: Performance with original (ori), similarity(sim) and dissimilar(dis) sets averaged across three non-reasoning LLMs. The area between the two sets is filled with colors, indicating the relative performance at each point. 20 60 100 20 45 70 BANKING77 Sim>Ori/Dis Original Most Similar Most Dissimilar 20 60 100 67 69 71 DetectiveQA Ori/Dis>Sim Original Most Similar Most Dissimilar 20 60 100 36 45 56 geometry O… view at source ↗
Figure 10
Figure 10. Figure 10: Performance with original (ori), similarity(sim) and dissimilar(dis) sets averaged across two reasoning LLMs. The area between the two sets is filled with colors, indicating the relative performance at each point. B. Statistical Robustness on a New ICL Subset We compute the mean and standard deviation across five random demonstration-ordering seeds, and repeat the analy￾sis on a newly sampled ICL subset. … view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for WSC task D. CDS: Details and Implementation D.1. Model Studies We design our experiments to isolate the effect of demon￾stration ordering in many-shot CoT-ICL. To this end, we focus on reasoning-oriented LLMs that exhibit a positive scaling trend with more demonstrations, since such models demonstrate in-context learning capacity and should benefit from improved ordering. We also control for co… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for BANKING77 task Given a question, predict the label of the question. You can only make predictions from the following categories: { LIST_OF_CATEGORIES} Please predict the intent category of the FINAL utterance with the provided demonstration example queries as follows: utterance: {question_1} intent category: {label_1} ... utterance: {question_n} intent category: {label_n} Now predict the intent… view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for NLU task The prompt for inference is presented in [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for GSM8K task Write a response that appropriately completes the request and wrap the final answer inside \boxed{}. Problem: {question_1} Solution: {CoT_with_answer_1} ... Problem: {question_n} Solution: {CoT_with_answer_n} ### Problem: {question_t} ### Solution: Let’s think step by step [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Unified prompt for MATH task Below is an instruction that describes a task.\n Select the correct option from A/B/C/D. Answer with ’The answer is {A/ B/C/D}.’ in the end of your response.\n\n" Question: {question_1} Context: {context_1} Options: A. {option_1_1} B. {option_1_2} C. {option_1_3} D. {option_1_4} Answer: {derivation_1} The answer is {answer_1}. ... Question: {question_n} Context: {context_n} Op… view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for DetectiveQA task 17 [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript studies many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning tasks. It reports three main empirical patterns: (i) a setting-dependent scaling effect in which adding more CoT demonstrations is unstable for non-reasoning LLMs but beneficial for reasoning-oriented models; (ii) failure of semantic similarity retrieval on reasoning tasks because it does not capture procedural compatibility; and (iii) an order-scaling effect in which performance variance grows with the number of demonstrations. The authors interpret these behaviors as evidence that many-shot CoT-ICL functions as in-context test-time learning rather than scaled pattern matching, propose two guiding principles (demonstrations should be easy for the model to understand and should be ordered for smooth conceptual progression), and introduce Curvilinear Demonstration Selection (CDS) as a concrete ordering heuristic that yields up to a 5.42 percentage-point gain on geometry tasks with 64 demonstrations.

Significance. If the central empirical patterns and the attribution of gains to ordering hold, the work supplies a useful conceptual reframing of long-context ICL as structured curriculum-style learning and a practical, low-overhead method (CDS) that improves reasoning performance. The cross-model and cross-task consistency of the reported scaling behaviors is a positive feature that could inform future prompt-engineering practice.

major comments (3)
  1. [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.
  2. [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.
  3. [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.
minor comments (2)
  1. [Methods] Define 'reasoning-oriented LLMs' versus 'non-reasoning LLMs' more explicitly in the methods section, including the criteria used for classification.
  2. [Figures and tables] Add error bars or confidence intervals to all performance plots and tables that report scaling curves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us strengthen the empirical rigor and clarity of the manuscript. We have revised the paper to address the concerns about ablations, variance reporting, method operationalization, and controls. Below we respond point-by-point to each major comment.

read point-by-point responses
  1. Referee: [CDS description and associated experiments] The experiments do not appear to ablate ordering while holding the exact demonstration set fixed. Consequently it remains unclear whether the reported 5.42 pp gain on geometry with 64 shots is produced by the curvilinear sequence itself or by the upstream selection of high-quality or procedurally compatible examples. This distinction is load-bearing for the claim that principle (ii) (smooth conceptual progression) explains the order-scaling effect and for the test-time-learning interpretation.

    Authors: We agree that isolating the contribution of ordering requires holding the demonstration set fixed. In the revised manuscript we have added a controlled ablation on the geometry task with 64 shots: we fix the exact set of demonstrations selected by CDS and compare performance under (a) the original curvilinear order versus (b) a random permutation of the same set. The curvilinear order yields a statistically significant improvement over random order on the fixed set, supporting that the ordering itself drives part of the gain and bolstering the smooth-progression principle. We have also clarified in Section 4.2 that CDS first selects candidates by procedural compatibility heuristics and then orders them; the new ablation separates these stages. revision: yes

  2. Referee: [Experimental results and methods] The results sections provide no variance measures (standard deviations or confidence intervals across random seeds or runs), no exact operationalization of the CDS curvilinear ordering procedure, and no explicit controls for prompt-length or token-budget confounds when scaling from few-shot to 64-shot regimes. These omissions weaken the support for both the scaling-effect claims and the performance gains attributed to CDS.

    Authors: We acknowledge these omissions. The revised version now reports standard deviations across five independent runs (different random seeds for ordering and model sampling) for all scaling curves and CDS results. We have added a precise algorithmic description of CDS, including pseudocode for the curvilinear traversal, in the methods section. To address token-budget confounds, we include a controlled experiment that truncates all prompts to the same maximum token length when scaling from 4 to 64 shots; the reported scaling trends and CDS gains remain consistent under this control. revision: yes

  3. Referee: [Discussion and interpretation] The interpretation of many-shot CoT-ICL as in-context test-time learning is offered as a post-hoc reframing of the observed behaviors. No direct diagnostic experiments (e.g., incremental probing of concept acquisition or comparison against non-ordered but equally informative demonstration sets) are reported that would distinguish this account from alternative explanations such as improved coverage or reduced example interference.

    Authors: The test-time-learning framing is indeed interpretive and derived from the combination of the three observed patterns rather than from dedicated diagnostic probes. We have expanded the discussion to explicitly contrast this account with alternatives (improved coverage, reduced interference) and to acknowledge the absence of incremental probing experiments as a limitation. The new fixed-set ordering ablation helps differentiate ordering effects from pure coverage, but we agree that stronger causal evidence would require additional experiments (e.g., step-wise concept probes) that are beyond the scope of the current revision. revision: partial

standing simulated objections not resolved
  • Direct diagnostic experiments (incremental probing of concept acquisition) to causally distinguish the test-time learning interpretation from alternatives such as coverage or interference.

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical observations and heuristic proposal

full rationale

The paper reports empirical scaling effects for many-shot CoT-ICL, interprets them as in-context test-time learning, states two principles, and introduces CDS as a simple ordering heuristic guided by those principles. No derivation reduces by construction to its inputs: there are no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz smuggled via prior work. Performance gains are presented as experimental results on specific tasks and models rather than tautological outcomes of the framing itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can perform test-time learning from ordered demonstrations; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption LLMs perform in-context test-time learning that improves when demonstrations are ordered for smooth conceptual progression
    Invoked to explain why standard many-shot rules fail and to motivate the CDS method.

pith-pipeline@v0.9.0 · 5614 in / 1277 out tokens · 55230 ms · 2026-05-14T19:12:41.184249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

    Chung, Tsz Ting and Cui, Leyang and Liu, Lemao and Huang, Xinting and Shi, Shuming and Yeung, Dit-Yan. Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.646

  2. [2]

    Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning

    Crosbie, Joy and Shutova, Ekaterina. Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.283

  3. [3]

    2022 , eprint=

    In-context Learning and Induction Heads , author=. 2022 , eprint=

  4. [4]

    The Stochastic Parrot on LLM ' s Shoulder: A Summative Assessment of Physical Concept Understanding

    Yu, Mo and Liu, Lemao and Wu, Junjie and Chung, Tsz Ting and Zhang, Shunchi and Li, Jiangnan and Yeung, Dit-Yan and Zhou, Jie. The Stochastic Parrot on LLM ' s Shoulder: A Summative Assessment of Physical Concept Understanding. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...

  5. [5]

    D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

    Chung, Tsz Ting and Liu, Lemao and Yu, Mo and Yeung, Dit-Yan. D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.47

  6. [6]

    arXiv preprint arXiv:2009.07896 (2020)

    Captum: A unified and generic model interpretability library for PyTorch , year =. arXiv , author =:2009.07896 , primaryclass =

  7. [7]

    Measuring Mathematical Problem Solving With the MATH Dataset , year =

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , journal =. Measuring Mathematical Problem Solving With the MATH Dataset , year =

  8. [8]

    Training Verifiers to Solve Math Word Problems , url =

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , journal =. Training Verifiers to Solve Math Word Problems , url =

  9. [9]

    Increasing Naturalness and Flexibility in Spoken Dialogue Interaction , doi =

    Benchmarking Natural Language Understanding Services for Building Conversational Agents , url =. Increasing Naturalness and Flexibility in Spoken Dialogue Interaction , doi =

  10. [10]

    Efficient Intent Detection with Dual Sentence Encoders , url =

    Casanueva, I. Efficient Intent Detection with Dual Sentence Encoders , url =. Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , doi =

  11. [11]

    Proceedings of the first international conference on Human language technology research , year=

    Toward semantics-based answer pinpointing , author=. Proceedings of the first international conference on Human language technology research , year=

  12. [12]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  13. [13]

    Qwen2.5 Technical Report , url =

    Qwen and : and An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu an...

  14. [14]

    In-context learning with long-context models: An in-depth exploration , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  15. [15]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , publisher =. The Theory of Parsing, Translation and Compiling , volume =

  16. [16]

    Publications Manual , year =

  17. [17]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , doi =. Alternation , volume =. Journal of the Association for Computing Machinery , number =

  18. [18]

    Tetreault , journal =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , journal =. Yara Parser:

  19. [19]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =

    Ando, Rie Kubota and Zhang, Tong , issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , volume =. Journal of Machine Learning Research , numpages =

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Many-shot in-context learning , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , url =

    Yafu Li and Xuyang Hu and Xiaoye Qu and Linjie Li and Yu Cheng , journal =. Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , url =

  22. [22]

    Dr.ICL: Demonstration-Retrieved In-context Learning , url =

    Man Luo and Xin Xu and Zhuyun Dai and Panupong Pasupat and Mehran Kazemi and Chitta Baral and Vaiva Imbrasaite and Vincent Y Zhao , journal =. Dr.ICL: Demonstration-Retrieved In-context Learning , url =

  23. [23]

    What Makes Good In-Context Examples for

    Liu, Jiachang and Shen, Dinghan and Zhang, Yizhe and Dolan, Bill and Carin, Lawrence and Chen, Weizhu , booktitle =. What Makes Good In-Context Examples for. doi:10.18653/v1/2022.deelio-1.10 , editor =

  24. [24]

    Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering , url =

    Wu, Zhiyong and Wang, Yaoxiang and Ye, Jiacheng and Kong, Lingpeng , booktitle =. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering , url =. doi:10.18653/v1/2023.acl-long.79 , editor =

  25. [25]

    Exploring the Role of Diversity in Example Selection for In-Context Learning , url =

    Janak Kapuriya and Manit Kaushik and Debasis Ganguly and Sumit Bhatia , journal =. Exploring the Role of Diversity in Example Selection for In-Context Learning , url =

  26. [26]

    Cohen , issn =

    Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen , issn =. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , url =. Transactions on Machine Learning Research , note =

  27. [27]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

  28. [28]

    arXiv preprint arXiv:2501.04519 , year=

    rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , year =. arXiv , author =:2501.04519 , primaryclass =

  29. [29]

    LongRoPE: Extending

    Yiran Ding and Li Lyna Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , bibsource =. LongRoPE: Extending. Forty-first International Conference on Machine Learning,

  30. [30]

    Han, Chi and Wang, Qifan and Peng, Hao and Xiong, Wenhan and Chen, Yu and Ji, Heng and Wang, Sinong , booktitle =

  31. [31]

    YaRN: Efficient Context Window Extension of Large Language Models , url =

    Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , bibsource =. YaRN: Efficient Context Window Extension of Large Language Models , url =. The Twelfth International Conference on Learning Representations,

  32. [32]

    Long-context LLMs Struggle with Long In-context Learning , url =

    Tianle Li and Ge Zhang and Quy Duc Do and Xiang Yue and Wenhu Chen , journal =. Long-context LLMs Struggle with Long In-context Learning , url =

  33. [33]

    Revisiting In-Context Learning with Long Context Language Models , url =

    Jinheon Baek and Sun Jae Lee and Prakhar Gupta and Geunseob Oh and Siddharth Dalmia and Prateek Kolhar , journal =. Revisiting In-Context Learning with Long Context Language Models , url =

  34. [34]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  35. [35]

    International Conference on Machine Learning , pages=

    Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  36. [36]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    In-context learning and gradient descent revisited , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  37. [37]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke , booktitle =. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , url =. doi:10.18653/v1/2022.emnlp-main.759 , editor =

  38. [38]

    An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels , url =

    Sorensen, Taylor and Robinson, Joshua and Rytting, Christopher and Shaw, Alexander and Rogers, Kyle and Delorey, Alexia and Khalil, Mahmoud and Fulda, Nancy and Wingate, David , booktitle =. An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels , url =. doi:10.18653/v1/2022.acl-long.60 , editor =

  39. [39]

    How Do In-Context Examples Affect Compositional Generalization? , url =

    An, Shengnan and Lin, Zeqi and Fu, Qiang and Chen, Bei and Zheng, Nanning and Lou, Jian-Guang and Zhang, Dongmei , booktitle =. How Do In-Context Examples Affect Compositional Generalization? , url =. doi:10.18653/v1/2023.acl-long.618 , editor =

  40. [40]

    Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection , url =

    Costas Mavromatis and Balasubramaniam Srinivasan and Zhengyuan Shen and Jiani Zhang and Huzefa Rangwala and Christos Faloutsos and George Karypis , journal =. Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection , url =

  41. [41]

    Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle =. Scaling

  42. [42]

    Let's Verify Step by Step , url =

    Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , bibsource =. Let's Verify Step by Step , url =. The Twelfth International Conference on Learning Representations,

  43. [43]

    The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =

    Bill Yuchen Lin and Abhilasha Ravichander and Ximing Lu and Nouha Dziri and Melanie Sclar and Khyathi Raghavi Chandu and Chandra Bhagavatula and Yejin Choi , bibsource =. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning , url =. The Twelfth International Conference on Learning Representations,

  44. [44]

    Automatic Chain of Thought Prompting in Large Language Models , url =

    Zhuosheng Zhang and Aston Zhang and Mu Li and Alex Smola , bibsource =. Automatic Chain of Thought Prompting in Large Language Models , url =. The Eleventh International Conference on Learning Representations,

  45. [45]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  46. [46]

    PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts , url =

    Mo Yu and Tsz Ting Chung and Chulun Zhou and Tong Li and Rui Lu and Jiangnan Li and Liyan Xu and Haoshu Lu and Ning Zhang and Jing Li and Jie Zhou , journal =. PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts , url =

  47. [47]

    Deep Research Agents: A Systematic Examination And Roadmap , url =

    Yuxuan Huang and Yihang Chen and Haozheng Zhang and Kang Li and Huichi Zhou and Meng Fang and Linyi Yang and Xiaoguang Li and Lifeng Shang and Songcen Xu and Jianye Hao and Kun Shao and Jun Wang , journal =. Deep Research Agents: A Systematic Examination And Roadmap , url =

  48. [48]

    Computers & Geosciences , volume=

    Principal components analysis (PCA) , author=. Computers & Geosciences , volume=. 1993 , publisher=

  49. [49]

    Umap: Uniform manifold approximation and projection for dimension reduction , url =

    McInnes, Leland and Healy, John and Melville, James , journal =. Umap: Uniform manifold approximation and projection for dimension reduction , url =

  50. [50]

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , url =

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , journal =. Scaling llm test-time compute optimally can be more effective than scaling model parameters , url =

  51. [51]

    2020 , publisher=

    Encyclopedia of infant and early childhood development , author=. 2020 , publisher=

  52. [52]

    A method for solving traveling-salesman problems , volume =

    Croes, Georges A , journal =. A method for solving traveling-salesman problems , volume =

  53. [53]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , url =

    Yanzhao Zhang and Mingxin Li and Dingkun Long and Xin Zhang and Huan Lin and Baosong Yang and Pengjun Xie and An Yang and Dayiheng Liu and Junyang Lin and Fei Huang and Jingren Zhou , journal =. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , url =

  54. [54]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  55. [55]

    Qwen3 Technical Report , url =

    An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...

  56. [56]

    DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels , url =

    Zhe Xu and Jiasheng Ye and Xiaoran Liu and Xiangyang Liu and Tianxiang Sun and Zhigeng Liu and Qipeng Guo and Linlin Li and Qun Liu and Xuanjing Huang and Xipeng Qiu , journal =. DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels , url =

  57. [57]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  58. [58]

    Gonzalez and Hao Zhang and Ion Stoica , booktitle =

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle =. Efficient Memory Management for Large Language Model Serving with PagedAttention , year =