arxiv: 2605.15177 · v1 · submitted 2026-05-14 · 💻 cs.AI

Recognition: no theorem link

OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

Shang Zhou , Wenhao Chai , Kaiyuan Liu , Huanzhi Mao , Qiuyang Mang , Jingbo Shang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time computeLLM reasoningBradley-Terry modelpairwise comparisonpopulation-based selectionCodeforces benchmarkparallel reasoning

0 comments

The pith

Bradley-Terry aggregation of LLM pairwise judgments improves reasoning by 405 Elo points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scaling test-time compute by sampling many reasoning candidates in parallel runs into a selection problem because single LLM judgments are noisy. It solves this by having the model compare random pairs of candidates, then aggregates those votes into a global ranking with the Bradley-Terry model. Top-ranked traces are kept and rewritten using the natural-language critiques from the comparisons, while the bottom quarter is dropped; the cycle repeats. On Codeforces problems this lifts Gemini 3.1 Pro by 405 effective Elo points after eight rounds. The same pipeline works on both weaker and stronger models and produces larger gains on domains with objective answers.

Core claim

OpenDeepThink maintains a population of reasoning traces. In each round the LLM judges random pairs, the Bradley-Terry model turns those votes into a single ranking, the top three-quarters are retained and mutated with the comparison critiques, and the bottom quarter is discarded. Eight such rounds raise Gemini 3.1 Pro's effective Codeforces Elo by 405 points while the method transfers across model sizes and concentrates its benefit in verifiable domains.

What carries the argument

Bradley-Terry aggregation converts noisy pairwise LLM comparisons into a global ranking that drives selection and critique-based mutation of the best reasoning traces.

If this is right

The pipeline transfers to both weaker and stronger models without any retuning.
Gains concentrate in objectively verifiable domains and can reverse in subjective ones on the HLE benchmark.
The released CF-73 set of 73 Codeforces problems achieves 99 percent agreement with official verdicts under local evaluation.
Eight sequential rounds of parallel sampling plus Bradley-Terry selection produce the reported 405-point Elo lift in roughly 27 minutes of wall-clock time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection loop could be combined with depth-scaling methods that extend a single trace rather than replacing breadth with breadth.
If pairwise consistency holds, the approach reduces reliance on expensive ground-truth verifiers for open-ended reasoning tasks.
The method might extend to non-code domains if the mutation step is adapted to preserve domain-specific constraints.
Curating small high-agreement benchmarks like CF-73 could become a standard way to measure progress on selection-based test-time scaling.

Load-bearing premise

LLM pairwise judgments remain consistent and unbiased across random pairs so that the resulting Bradley-Terry ranking reliably identifies higher-quality reasoning traces.

What would settle it

On a set of problems with known correct solutions, the Bradley-Terry ranking shows no positive correlation with actual correctness or with human Elo ratings.

read the original abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenDeepThink gives a straightforward pipeline for test-time scaling via pairwise LLM votes and Bradley-Terry ranking, with a big reported Elo gain and a useful new dataset, but the evidence for why the gains occur is still thin.

read the letter

The paper describes OpenDeepThink as a population-based method that samples multiple reasoning traces, has the LLM judge random pairs, aggregates the votes into a global ranking with Bradley-Terry, keeps the top three-quarters, mutates them using the critiques from the comparisons, and drops the bottom quarter. It reports lifting Gemini 3.1 Pro by +405 Codeforces Elo after eight rounds and shows the pipeline works across model strengths without retuning. Gains concentrate in verifiable domains on HLE and reverse in subjective ones. They also release CF-73, a 73-problem Codeforces set with International Grandmaster ratings and 99% agreement against official verdicts. That dataset release is the clearest concrete contribution so far. The pairwise-plus-Bradley-Terry approach is a reasonable way to sidestep noisy pointwise scoring, and the mutation step using natural-language critiques is a practical addition. The overall pipeline is simple enough that others could reimplement it quickly. The main gaps are in the supporting evidence. The abstract states the headline result but gives no error bars, no ablation that isolates the Bradley-Terry aggregation from plain top-k selection, and no checks on inter-judge agreement or position/length bias in the pairwise votes. Without those, it is hard to know whether the Elo lift comes from better selection or from the LLM simply favoring longer or more verbose traces. The stress-test concern about inconsistent or biased judgments lands directly on the central claim. A reader working on inference-time scaling or parallel sampling would find the pipeline and the dataset worth looking at. The work is coherent on its own terms and shows clear engagement with the test-time compute literature, so it is worth sending to referees even if the current version needs more controls and ablations before publication.

Referee Report

4 major / 2 minor

Summary. The paper introduces OpenDeepThink, a population-based test-time compute framework for LLMs. It samples multiple reasoning traces in parallel, elicits LLM judgments on random pairs, aggregates the votes into a global ranking via the Bradley-Terry model, preserves the top-ranked traces, mutates the top three-quarters using the natural-language critiques generated during pairwise comparison, and discards the bottom quarter. The central empirical claim is that this procedure raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points after eight sequential rounds (approximately 27 minutes wall-clock time). The method is reported to transfer across model scales without retuning and to produce gains concentrated in objectively verifiable domains on the multi-domain HLE benchmark. The authors also release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% agreement against official verdicts.

Significance. If the Bradley-Terry aggregation from LLM pairwise votes reliably selects higher-quality reasoning traces without access to ground-truth verifiers, the approach would constitute a meaningful contribution to parallel test-time scaling. It offers a concrete, verifier-free alternative to existing breadth-scaling techniques and demonstrates transferability. The release of the CF-73 dataset with high local-evaluation agreement is a clear positive for the community and enables future controlled studies.

major comments (4)

[Results] Results section (Elo improvement claim): The +405 Elo gain for Gemini 3.1 Pro is presented without error bars, standard deviations across runs, or statistical significance tests. Given that the method relies on stochastic LLM judgments, this omission makes it impossible to determine whether the reported improvement is robust or could arise from variance in the Bradley-Terry aggregation.
[Experiments] Experimental evaluation: No ablation is reported that isolates the contribution of Bradley-Terry ranking (for example, by replacing it with random selection, pointwise scoring, or majority vote while keeping the mutation step fixed). Without this control, it remains unclear whether the observed gains are driven by the pairwise aggregation or simply by the critique-based mutation applied to a population.
[Dataset and Evaluation] Validation on CF-73: Although the dataset is released with 99% local-evaluation agreement, the manuscript provides no calibration of the resulting Bradley-Terry scores against the ground-truth verdicts, nor any inter-rater agreement statistics (e.g., Cohen's kappa or rank correlation) for the LLM judges on these problems. This leaves the core assumption—that LLM pairwise votes track true solution quality—unverified on the very benchmark used for the main claim.
[HLE Experiments] HLE benchmark analysis: The statement that gains appear concentrated in objectively verifiable domains and reverse in subjective ones is presented without a per-domain breakdown, sample sizes per category, or statistical comparison. This weakens the supporting evidence for the method's selective benefit.

minor comments (2)

[Implementation Details] The wall-clock timing of ~27 minutes is stated in the abstract but lacks a breakdown by number of LLM calls, batch sizes, or hardware specification in the main text.
[Method] Notation for the Bradley-Terry parameters and the mutation operator could be introduced more formally with an explicit algorithm box or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and commit to revisions that strengthen the empirical claims without altering the core method or results.

read point-by-point responses

Referee: [Results] Results section (Elo improvement claim): The +405 Elo gain for Gemini 3.1 Pro is presented without error bars, standard deviations across runs, or statistical significance tests. Given that the method relies on stochastic LLM judgments, this omission makes it impossible to determine whether the reported improvement is robust or could arise from variance in the Bradley-Terry aggregation.

Authors: We agree that variability measures are essential given the stochastic nature of LLM judgments. We will rerun the Gemini 3.1 Pro experiments across five independent random seeds, report standard deviations and error bars on the Elo curves, and include a paired statistical test against the single-trace baseline in the revised Results section. revision: yes
Referee: [Experiments] Experimental evaluation: No ablation is reported that isolates the contribution of Bradley-Terry ranking (for example, by replacing it with random selection, pointwise scoring, or majority vote while keeping the mutation step fixed). Without this control, it remains unclear whether the observed gains are driven by the pairwise aggregation or simply by the critique-based mutation applied to a population.

Authors: We accept that an explicit ablation is needed to isolate the ranking mechanism. In the revised manuscript we will add a controlled ablation on CF-73 that keeps the mutation and population size fixed while replacing Bradley-Terry aggregation with (i) random selection, (ii) pointwise LLM scoring, and (iii) majority vote, reporting the resulting Elo trajectories for each variant. revision: yes
Referee: [Dataset and Evaluation] Validation on CF-73: Although the dataset is released with 99% local-evaluation agreement, the manuscript provides no calibration of the resulting Bradley-Terry scores against the ground-truth verdicts, nor any inter-rater agreement statistics (e.g., Cohen's kappa or rank correlation) for the LLM judges on these problems. This leaves the core assumption—that LLM pairwise votes track true solution quality—unverified on the very benchmark used for the main claim.

Authors: We will add a calibration subsection that computes Kendall's tau rank correlation between the Bradley-Terry scores and the ground-truth verdicts on CF-73, together with inter-judge agreement statistics (pairwise agreement rate and Kendall's tau) across multiple LLM judge runs. These metrics will be reported alongside the existing 99% expert-agreement figure. revision: yes
Referee: [HLE Experiments] HLE benchmark analysis: The statement that gains appear concentrated in objectively verifiable domains and reverse in subjective ones is presented without a per-domain breakdown, sample sizes per category, or statistical comparison. This weakens the supporting evidence for the method's selective benefit.

Authors: We will expand the HLE analysis with a table showing per-domain sample sizes, mean Elo deltas, and 95% confidence intervals, plus a statistical comparison (two-sample t-test) between the objectively verifiable and subjective domain groups to support the selectivity claim. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit algorithmic procedure using standard external Bradley-Terry model

full rationale

The paper presents OpenDeepThink as a procedural population-based algorithm: sample candidates, obtain LLM pairwise judgments on random pairs, aggregate via the standard Bradley-Terry model into a global ranking, preserve top candidates, mutate the top three-quarters using generated critiques, and discard the bottom quarter. This is an explicit algorithmic recipe rather than a mathematical derivation. Bradley-Terry is invoked as an established external statistical model for ranking from pairwise comparisons, with no equations shown that reduce the ranking or selection step to fitted parameters or self-referential definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are described in the provided text. The +405 Elo improvement is reported as an empirical outcome on CF-73 and HLE, not as a prediction forced by the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework adds no new free parameters or invented entities but rests on the domain assumption that LLM pairwise comparisons produce usable ranking signals.

axioms (1)

domain assumption LLM pairwise judgments are sufficiently consistent to support reliable Bradley-Terry ranking of reasoning quality
Invoked as the basis for selection without additional validation steps described in the abstract.

pith-pipeline@v0.9.0 · 5518 in / 1215 out tokens · 62893 ms · 2026-05-15T03:02:21.710282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 11 internal anchors

[1]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

work page arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

work page arXiv
[8]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173,

work page 2023
[9]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

work page arXiv
[10]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

work page 2023
[13]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdi- nov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,

work page arXiv
[14]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Al- phaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014, 2025

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

work page arXiv
[17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

work page arXiv
[20]

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

work page arXiv
[21]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

work page arXiv
[22]

Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

work page arXiv
[23]

Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

work page arXiv
[24]

We report two scenarios

centered loosely on the published rating of Gemini 3.1 Pro, optimizing the posterior with scipy.optimize.minimize scalar over the bounded interval[1000, 5000]. We report two scenarios. For gen-0 pass@1, the per-problem likelihood is Binomial with n= 20 independent gen-0 samples andkaccepted; this measures the rating implied by naive sampling. For the fina...

work page 2023