Rationale-Augmented Ensembles in Language Models

Dale Schuurmans; Denny Zhou; Ed Chi; Jason Wei; Quoc Le; Xuezhi Wang

arxiv: 2207.00747 · v1 · pith:OCVMF4C6new · submitted 2022-07-02 · 💻 cs.CL

Rationale-Augmented Ensembles in Language Models

Xuezhi Wang , Jason Wei , Dale Schuurmans , Quoc Le , Ed Chi , Denny Zhou This is my paper

classification 💻 cs.CL

keywords rationale-augmentedpromptingrationalesensemblesoutputperformancedemonstrateexisting

0 comments

read the original abstract

Recent research has shown that rationales, or step-by-step chains of thought, can be used to improve performance in multi-step reasoning tasks. We reconsider rationale-augmented prompting for few-shot in-context learning, where (input -> output) prompts are expanded to (input, rationale -> output) prompts. For rationale-augmented prompting we demonstrate how existing approaches, which rely on manual prompt engineering, are subject to sub-optimal rationales that may harm performance. To mitigate this brittleness, we propose a unified framework of rationale-augmented ensembles, where we identify rationale sampling in the output space as the key component to robustly improve performance. This framework is general and can easily be extended to common natural language processing tasks, even those that do not traditionally leverage intermediate steps, such as question answering, word sense disambiguation, and sentiment analysis. We demonstrate that rationale-augmented ensembles achieve more accurate and interpretable results than existing prompting approaches--including standard prompting without rationales and rationale-based chain-of-thought prompting--while simultaneously improving interpretability of model predictions through the associated rationales.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
PAL: Program-aided Language Models
cs.CL 2022-11 conditional novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
cs.AI 2026-05 unverdicted novelty 7.0

REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL
cs.LG 2026-05 conditional novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Measuring Faithfulness in Chain-of-Thought Reasoning
cs.AI 2023-07 conditional novelty 7.0

Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback
cs.CL 2026-05 unverdicted novelty 6.0

A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavi...
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
cs.CV 2026-04 unverdicted novelty 6.0

A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
Beyond Fine-Tuning: In-Context Learning and Chain-of-Thought for Reasoned Distractor Generation
cs.CL 2026-04 unverdicted novelty 6.0

LLMs prompted with few-shot examples and rationales generate better reasoned distractors for MCQs than fine-tuned contrastive models across six benchmarks.
Large Language Models Can Self-Improve
cs.CL 2022-10 unverdicted novelty 6.0

A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
cs.CL 2026-06 unverdicted novelty 5.0

A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
LLM Multi-Agent Systems: Challenges and Open Problems
cs.MA 2024-02 unverdicted novelty 2.0

The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.
A Comprehensive Overview of Large Language Models
cs.CL 2023-07 unverdicted novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.