Title resolution pending

· 2022 · arXiv 2206.02336

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Let's Verify Step by Step

cs.LG · 2023-05-31 · accept · novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

cs.CL · 2023-05-06 · conditional · novelty 7.0

Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.

CodeT: Code Generation with Generated Tests

cs.CL · 2022-07-21 · conditional · novelty 7.0

CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.

Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

cs.MA · 2026-04-03 · unverdicted · novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

cs.CL · 2024-06-05 · conditional · novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

Reasoning with Language Model is Planning with World Model

cs.CL · 2023-05-24 · unverdicted · novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.

Multimodal Chain-of-Thought Reasoning in Language Models

cs.CL · 2023-02-02 · accept · novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.

Solving math word problems with process- and outcome-based feedback

cs.LG · 2022-11-25 · unverdicted · novelty 6.0

On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

cs.CL · 2022-10-17 · accept · novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

cs.CL · 2025-07-25 · unverdicted · novelty 5.0

Question interpretation diversity outperforms model diversity for LLM ensembling on binary QA tasks using majority voting.

AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning

cs.CL · 2024-10-17 · unverdicted · novelty 5.0

AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.

PaLM 2 Technical Report

cs.CL · 2023-05-17 · unverdicted · novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

citing papers explorer

Showing 13 of 13 citing papers.

Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 10
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models cs.CL · 2023-05-06 · conditional · none · ref 1
Plan-and-Solve prompting improves zero-shot LLM reasoning by first creating an explicit plan then executing subtasks, outperforming simple 'think step by step' prompts across ten datasets.
CodeT: Code Generation with Generated Tests cs.CL · 2022-07-21 · conditional · none · ref 9
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate cs.MA · 2026-04-03 · unverdicted · none · ref 16
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision cs.CL · 2024-06-05 · conditional · none · ref 10
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 116
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 24
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
Solving math word problems with process- and outcome-based feedback cs.LG · 2022-11-25 · unverdicted · none · ref 25
On GSM8K, outcome-based supervision achieves similar final-answer error rates to process-based with less labeling, but process-based or learned reward models are needed to reach 3.4% reasoning error among correct solutions.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 16
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question cs.CL · 2025-07-25 · unverdicted · none · ref 3
Question interpretation diversity outperforms model diversity for LLM ensembling on binary QA tasks using majority voting.
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning cs.CL · 2024-10-17 · unverdicted · none · ref 11
AdaSwitch improves small local LLM performance on reasoning tasks by adaptively switching to a large cloud LLM upon detected errors, sometimes matching cloud results with far less overhead.
PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 267
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 161
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer