Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

Majd Hawasly; Md Rizwan Parvez; Mohammad Raza; Raman Saparkhan

arxiv: 2604.17433 · v2 · pith:QWHABWM5new · submitted 2026-04-19 · 💻 cs.CL · cs.AI· cs.LG

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

Raman Saparkhan , Majd Hawasly , Md Rizwan Parvez , Mohammad Raza This is my paper

Pith reviewed 2026-05-10 05:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords self-consistencychain-of-thoughtprogram-of-thoughtLLM reasoningensemblingearly stoppingefficient inferencelarge language models

0 comments

The pith

CoT-PoT ensembling cuts the samples needed for LLM self-consistency by 9.3 times while raising accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hybrid approach that combines Chain-of-Thought and Program-of-Thought reasoning inside the self-consistency process for large language models. This method uses the complementary strengths of verbal step-by-step reasoning and executable program-style reasoning to reach consistent answers. It reports both higher overall accuracy and a sharp drop in the number of samples required, with most tasks handled by only two samples. A reader would care because the approach lowers the high computational cost that has limited self-consistency in practice.

Core claim

The authors establish that ensembling Chain-of-Thought and Program-of-Thought outputs within self-consistency improves accuracy and reduces the required samples by a factor of 9.3. In particular, 78.6 percent of tasks can be solved correctly with only two samples through agreement-based early stopping. The framework supports both full sampling and early-stopping strategies that exploit the two distinct reasoning modes.

What carries the argument

The CoT-PoT ensembling framework that aggregates outputs from Chain-of-Thought and Program-of-Thought reasoning paths and stops early when they agree.

If this is right

Accuracy on reasoning benchmarks rises above standard self-consistency baselines.
The average number of samples per task falls by a factor of 9.3.
78.6 percent of tasks reach correct answers with only two samples.
Early stopping based on mode agreement becomes practical for many problems.
Computational cost for inference drops while maintaining or improving reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairs of other reasoning formats might produce similar sample reductions if they remain complementary.
The method suggests that format diversity can substitute for sample quantity in self-consistency.
Real-time applications with tight latency budgets could adopt dual-mode sampling as a default.
Extending the approach to additional reasoning styles or domains would test its generality.

Load-bearing premise

That Chain-of-Thought and Program-of-Thought outputs are sufficiently complementary so their agreement reliably signals the correct answer without new error modes.

What would settle it

A dataset where CoT and PoT outputs agree on wrong answers at a high rate, causing accuracy to fall below that of standard self-consistency with more samples.

Figures

Figures reproduced from arXiv: 2604.17433 by Majd Hawasly, Md Rizwan Parvez, Mohammad Raza, Raman Saparkhan.

**Figure 2.** Figure 2: Percentage of problems solved with only two [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency vs. sampling budget across differ [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a hybrid CoT-PoT ensembling approach within the self-consistency framework for LLMs. It combines Chain-of-Thought and Program-of-Thought reasoning modes, with strategies for both full sampling and early-stopping on agreement, claiming not only higher overall accuracy but also a 9.3x reduction in required samples, such that 78.6% of tasks can be solved with only two samples.

Significance. If the efficiency claims hold with preserved accuracy, the work could meaningfully reduce the computational overhead of self-consistency, making it more practical for deployment. The empirical reporting of measured sample reductions is a strength, but the absence of conditional accuracy breakdowns on early-stopped cases weakens the ability to assess whether the gains are achieved without new error modes.

major comments (2)

[Abstract] Abstract: The headline efficiency claim (9.3x sample reduction and 78.6% of tasks solved with exactly two samples via early-stopping on CoT-PoT agreement) is load-bearing for the contribution, yet the abstract provides no indication that accuracy is broken out for the early-stopped subset versus the continued-sampling subset. Without this, it is impossible to verify that agreement after one CoT and one PoT is a correctness signal comparable to full majority vote.
[Early-stopping strategy] Early-stopping strategy: The premise that CoT and PoT outputs are sufficiently complementary for their agreement to serve as a reliable stopping criterion (without introducing new error modes or requiring additional samples) is not supported by any reported conditional accuracy or error analysis on the 78.6% early-stopped cases. This risks the efficiency gains being achieved by selectively accepting lower-confidence answers on the majority of examples.

minor comments (1)

[Abstract] The abstract refers to 'particular strategies for both full sampling and early-stopping' without sufficient detail on implementation or pseudocode; adding a concise algorithmic description would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which identify key areas where additional clarity on our efficiency claims would strengthen the paper. We address each major comment below and indicate the revisions made.

read point-by-point responses

Referee: [Abstract] Abstract: The headline efficiency claim (9.3x sample reduction and 78.6% of tasks solved with exactly two samples via early-stopping on CoT-PoT agreement) is load-bearing for the contribution, yet the abstract provides no indication that accuracy is broken out for the early-stopped subset versus the continued-sampling subset. Without this, it is impossible to verify that agreement after one CoT and one PoT is a correctness signal comparable to full majority vote.

Authors: We agree that the abstract should better contextualize the efficiency results with respect to accuracy preservation. In the revised manuscript, we have updated the abstract to state that accuracy on the early-stopped subset remains comparable to full self-consistency, with a reference to the new conditional analysis added in the experiments section. This makes the load-bearing claim more transparent. revision: yes
Referee: [Early-stopping strategy] Early-stopping strategy: The premise that CoT and PoT outputs are sufficiently complementary for their agreement to serve as a reliable stopping criterion (without introducing new error modes or requiring additional samples) is not supported by any reported conditional accuracy or error analysis on the 78.6% early-stopped cases. This risks the efficiency gains being achieved by selectively accepting lower-confidence answers on the majority of examples.

Authors: We acknowledge that the original manuscript did not report conditional accuracy or error analysis specifically for the early-stopped cases, which limits the ability to fully validate the stopping criterion. We have added this analysis to the revised version, including accuracy breakdowns and error comparisons for the 78.6% of tasks. The new results confirm that agreement after one CoT and one PoT does not introduce new error modes and yields accuracy comparable to full sampling on those instances. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical ensembling method

full rationale

The paper is an empirical study proposing CoT-PoT hybrid ensembling for self-consistency, with full-sampling and early-stopping strategies. It reports measured outcomes such as 9.3x sample reduction and 78.6% of tasks solved with two samples. No equations, derivations, or self-referential definitions exist that would make any result equivalent to its inputs by construction. Claims rest on experimental validation rather than fitted parameters renamed as predictions or self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central efficiency claim rests on the unstated premise that CoT and PoT reasoning paths produce sufficiently independent errors so that their early agreement is a reliable stopping signal; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption CoT and PoT outputs are complementary enough that their agreement indicates correctness with high probability after only two samples.
This premise is required for the early-stopping strategy to preserve accuracy while reducing sample count.

pith-pipeline@v0.9.0 · 5464 in / 1318 out tokens · 35144 ms · 2026-05-10T05:36:51.832748+00:00 · methodology

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)