CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.
Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
DeepPrune prunes redundant parallel CoT traces via a judge model for equivalence prediction from partial traces plus online greedy clustering, delivering 65-88% token savings with accuracy within 3 points on AIME and GPQA benchmarks.
citing papers explorer
-
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.
-
OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
-
Stateful Reasoning via Insight Replay
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
-
DeepPrune: Parallel Scaling without Inter-trace Redundancy
DeepPrune prunes redundant parallel CoT traces via a judge model for equivalence prediction from partial traces plus online greedy clustering, delivering 65-88% token savings with accuracy within 3 points on AIME and GPQA benchmarks.