ACIL: Auto Chain of Thoughts for In-Context Learning

Rui Chu

arxiv: 2605.17088 · v1 · pith:WBF23SVBnew · submitted 2026-05-16 · 💻 cs.CL

ACIL: Auto Chain of Thoughts for In-Context Learning

Rui Chu This is my paper

Pith reviewed 2026-05-20 15:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords Auto-CoTChain-of-ThoughtIn-Context LearningLarge Language ModelsReasoning TasksPrompt ConstructionDemonstration Selection

0 comments

The pith

Auto-CoT automatically generates and selects reasoning chains to strengthen in-context learning demonstrations for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models gain from Chain-of-Thought reasoning on complex tasks, yet standard in-context learning supplies only input-output pairs that omit the intermediate steps. This paper introduces the Auto-CoT framework to address the gap by generating reasoning chains for examples, embedding those structured explanations into the prompt, and filtering out irrelevant or low-quality items via systematic selection. A reader would care because the change lets models follow clearer multi-step logic without any parameter updates or manual example writing. The approach augments the context with explicit guidance and shows accuracy gains across reasoning benchmarks by steering the model toward more reliable paths.

Core claim

The Auto-CoT framework constructs reasoning-enhanced demonstrations by automatically generating chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process, thereby providing explicit intermediate reasoning guidance that improves prediction accuracy on complex reasoning tasks.

What carries the argument

The Auto-CoT framework, which generates reasoning chains for examples and applies systematic selection to curate high-quality demonstrations for the in-context prompt.

If this is right

Accuracy rises on arithmetic, commonsense, and symbolic reasoning tasks when explicit steps appear in the prompt.
Manual construction of reasoning examples is no longer required for strong in-context performance.
New tasks can receive better reasoning support simply by running the generation and selection steps.
Models follow more reliable reasoning paths because the context now includes intermediate explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower reliance on human-written Chain-of-Thought examples across many tasks.
Combining Auto-CoT curation with other prompt techniques might produce further gains on hard problems.
Testing the same pipeline on domains outside the original benchmarks would check how far the gains extend.

Load-bearing premise

The automatic generation process creates accurate and relevant reasoning chains while the selection process reliably removes low-quality demonstrations without introducing new biases.

What would settle it

Experiments on the same reasoning tasks that show Auto-CoT producing equal or lower accuracy than standard in-context learning without added reasoning chains.

Figures

Figures reproduced from arXiv: 2605.17088 by Rui Chu.

**Figure 1.** Figure 1: Illustration of CoT enhancement for InContext Learning 2 Prior Works In Context Learning firstly widely noticed from few-shot learning(Brown et al., 2020b) and was formed into mathematical functions for deeper logic-level research to find explainable performance as in-context learning which(Garg et al., 2022; Xie et al., 2021) presented a systematic investigation into transformers’ in-context learning c… view at source ↗

**Figure 2.** Figure 2: In-Context Learning step scenario 3.3 Auto-Chain-of-Thought Implementation we are trying to arg miny ℓ(y, xk+1) is for our MSE loss comparing between perturbed output and the ground truth at y41 | xk+1 L(δ) = ℓ (Mθ(P + δ), f(xk+1)) with a Auto-CoT strategy: First, we augment the training pool by generating k different reasoning chains for each input-output pair in our linear function: P = {P1, P2, ..., Pk}… view at source ↗

**Figure 3.** Figure 3: Numeral MSE comparison observed at 4-length context (Baseline: 676.819 → Auto-CoT: 535.041). This indicates that AutoCoT enhances model accuracy by incorporating reasoning chains and selecting high-quality demonstrations. However, the relationship between context length and AUC shows non-linear behavior. Notably, Auto-CoT achieves its highest AUC (0.607) at 33-length context, suggesting that intermedia… view at source ↗

**Figure 4.** Figure 4: Numeral Workflow Visualization Algorithm 2: In-Context Learning with Auto-CoT for Linear Function Approximation Input: Training data D with dimension (40,20), Query set xquery Output: Predicted value yˆ41 Step 1: Augment Stage begin Initialize prompt pool P = {}; for i = 1 to K do Sample linear function fi(x) = w⊤ i x from F; Generate sequence P i = (x1, fi(x1), ..., xk, fi(xk)); Generate reasoning chain … view at source ↗

read the original abstract

Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a pipeline for auto-generating and filtering CoT demonstrations for ICL, but the abstract leaves the actual performance gains and chain quality unverified.

read the letter

The one thing to know is that the authors describe an end-to-end process that generates reasoning chains for existing input-output pairs, inserts those chains into the prompt, and then applies a selection filter to drop low-quality examples. This targets the known weakness that plain ICL demos give no intermediate steps on multi-step tasks. The second thing is that the write-up stays at the level of a high-level framework without showing the numbers or controls that would let us judge whether the added reasoning is what drives any improvement. The paper does a straightforward job stating the motivation and outlining the three steps: automatic chain generation, prompt augmentation, and systematic selection. That combination is presented as the new piece relative to earlier automatic-CoT or self-consistency papers, and the selection step in particular could be a practical addition for people already tuning ICL prompts. The soft spots are exactly where the stress-test note flags them. The abstract claims better performance across reasoning tasks but supplies no quantitative results, baselines, or error bars. More importantly, there is no audit of whether the generated chains are factually correct or merely fluent. Because the generator is presumably the same class of model whose reasoning we are trying to improve, any systematic error will be copied into the demonstrations. Without an ablation that isolates the effect of the reasoning content from the effect of the filter, or a human check on chain accuracy, it is hard to attribute gains to explicit intermediate guidance rather than selection artifacts such as length or surface fluency. This is the kind of incremental prompt-engineering note that people working on ICL for reasoning tasks might want to read for the selection idea. A reader who already runs experiments on multi-step benchmarks could test the pipeline quickly. It is not yet strong enough on its own to change practice, but the underlying problem is real and the proposed direction is reasonable. I would send it to peer review so the authors can add the missing quantitative details and validation steps; the core thinking is clear enough to deserve that look.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the ACIL (Auto Chain of Thoughts for In-Context Learning) framework to enhance In-Context Learning (ICL) in large language models. It automatically generates Chain-of-Thought (CoT) reasoning chains for input-output examples, incorporates these structured explanations into the prompt, and employs a systematic selection process to eliminate irrelevant or low-quality demonstrations. The central claim is that this provides explicit intermediate reasoning guidance, leading to improved performance on complex reasoning tasks, as supported by experiments across multiple tasks.

Significance. If the results hold and the generated chains are shown to be faithful, this could be a significant contribution by automating the creation of high-quality CoT demonstrations for ICL, thereby improving reliability on multi-step reasoning without the need for manual intervention or model fine-tuning. The framework's emphasis on selection to ensure quality is a positive aspect, though its effectiveness depends on the robustness of the generation and filtering steps.

major comments (2)

[§3] §3 (Methodology): The automatic generation of reasoning chains lacks any independent validation such as human agreement rates, error typology, or comparison to gold-standard CoT. This is load-bearing for the central claim, as generation by the same class of LLM risks reproducing systematic errors, making it impossible to confirm that gains arise from explicit reasoning guidance rather than selection artifacts.
[§5] §5 (Experiments): No ablation isolates the effect of the reasoning content from the selection filter (e.g., no comparison to length- or fluency-based selection of the same input-output pairs). Without this, the headline improvement cannot be attributed specifically to 'explicit intermediate reasoning guidance' as claimed.

minor comments (2)

[Abstract] The abstract supplies no quantitative results, baselines, error bars, or task details, which hinders immediate assessment of the claimed improvements.
Clarify the relationship between the title acronym 'ACIL' and the 'Auto-CoT' terminology used in the abstract and body to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Methodology): The automatic generation of reasoning chains lacks any independent validation such as human agreement rates, error typology, or comparison to gold-standard CoT. This is load-bearing for the central claim, as generation by the same class of LLM risks reproducing systematic errors, making it impossible to confirm that gains arise from explicit reasoning guidance rather than selection artifacts.

Authors: We agree that direct validation of the generated reasoning chains is important for supporting the claim that improvements stem from explicit intermediate reasoning. The current experiments rely on end-task performance as evidence, but this does not rule out artifacts from the LLM generator. In the revision we will add a dedicated analysis subsection that reports human agreement rates on a sampled subset of generated chains, provides an error typology, and compares a portion of the chains against available gold-standard CoT annotations. revision: yes
Referee: [§5] §5 (Experiments): No ablation isolates the effect of the reasoning content from the selection filter (e.g., no comparison to length- or fluency-based selection of the same input-output pairs). Without this, the headline improvement cannot be attributed specifically to 'explicit intermediate reasoning guidance' as claimed.

Authors: We accept that the existing comparisons do not fully isolate the contribution of the reasoning content from the selection mechanism. To address this, the revised manuscript will include a new ablation that applies the identical selection filter to the original input-output pairs (without generated reasoning) and reports the resulting performance. We will also add length- and fluency-based selection baselines for the same pairs to further clarify the source of gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper describes an Auto-CoT framework that automatically generates reasoning chains for ICL demonstrations, augments prompts with intermediate steps, and applies a selection filter to remove low-quality examples. The central claim is supported by experiments showing performance gains on reasoning tasks. No equations, fitted parameters, or load-bearing self-citations appear in the abstract or described content that would make any result equivalent to its inputs by construction. The method is defined procedurally and then tested externally, leaving the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that LLM-generated reasoning steps can be treated as reliable without external verification.

pith-pipeline@v0.9.0 · 5701 in / 1042 out tokens · 40598 ms · 2026-05-20T15:21:00.803933+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

work page 2007
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

work page 2005
[8]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

work page 1965
[9]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[11]

Advances in Neural Information Processing Systems , volume=

Dissecting chain-of-thought: Compositionality through in-context filtering and learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[15]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[16]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013
[18]

Advances in Neural Information Processing Systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[20]

Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arxiv 2019. arXiv preprint arXiv:1908.10063

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[22]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020 b . Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[23]

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583--30598

work page 2022
[24]

Sili Huang, Jifeng Hu, Hechang Chen, Lichao Sun, and Bo Yang. 2024. In-context decision transformer: Reinforcement learning via hierarchical chain-of-thought. arXiv preprint arXiv:2405.20692

work page arXiv 2024
[25]

KaShun Shum, Shizhe Diao, and Tong Zhang. 2023. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822

work page arXiv 2023
[26]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642

work page 2013
[27]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

work page 2007

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=

work page 2005

[8] [8]

and Tukey, John W

Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=

work page 1965

[9] [9]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[10] [11]

Advances in Neural Information Processing Systems , volume=

Dissecting chain-of-thought: Compositionality through in-context filtering and learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [15]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page

[12] [16]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

work page 2013

[13] [18]

Advances in Neural Information Processing Systems , volume=

What can transformers learn in-context? a case study of simple function classes , author=. Advances in Neural Information Processing Systems , volume=

work page

[14] [19]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[15] [20]

Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arxiv 2019. arXiv preprint arXiv:1908.10063

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [21]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020

[17] [22]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020 b . Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020

[18] [23]

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583--30598

work page 2022

[19] [24]

Sili Huang, Jifeng Hu, Hechang Chen, Lichao Sun, and Bo Yang. 2024. In-context decision transformer: Reinforcement learning via hierarchical chain-of-thought. arXiv preprint arXiv:2405.20692

work page arXiv 2024

[20] [25]

KaShun Shum, Shizhe Diao, and Tong Zhang. 2023. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822

work page arXiv 2023

[21] [26]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631--1642

work page 2013

[22] [27]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [28]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493

work page internal anchor Pith review Pith/arXiv arXiv 2022