OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation
Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3
The pith
OpenDeepThink improves LLM reasoning performance by aggregating pairwise LLM judgments into Bradley-Terry rankings to select and evolve the best candidates from parallel samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenDeepThink is a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. This raises Gemini 3.1 Pro's effective Codeforces Elo by 405 points in eight sequential LLM-call rounds.
What carries the argument
Bradley-Terry aggregation of LLM pairwise comparison votes, which converts noisy head-to-head judgments into a global ranking that drives both selection and natural-language mutation of candidate reasoning traces.
If this is right
- The same Bradley-Terry selection and mutation loop transfers to other models without parameter changes.
- Gains concentrate in objectively verifiable domains and can reverse in subjective ones.
- Eight LLM-call rounds suffice to produce a 405-point Elo lift on Codeforces-style problems.
- A new 73-problem benchmark with International Grandmaster annotations and 99 percent local-evaluation agreement is now available for further testing.
Where Pith is reading between the lines
- The approach could be stacked with depth-scaling techniques that extend single traces rather than breadth-scaling multiple ones.
- If pairwise aggregation proves reliable, similar self-comparison loops might help in open-ended generation tasks where no automatic scorer exists.
- The released CF-73 set offers a compact, high-agreement test bed for measuring whether any ranking method aligns with expert human judgment.
Load-bearing premise
LLM pairwise judgments, once aggregated via Bradley-Terry, produce rankings accurate enough to guide effective selection and natural-language mutation without access to ground-truth verifiers.
What would settle it
Run the full eight-round OpenDeepThink pipeline on the CF-73 problems while also scoring every final candidate against the official Codeforces verdicts; if the Bradley-Terry top-ranked traces do not solve substantially more problems than randomly chosen or pointwise-scored traces, the central claim is false.
read the original abstract
Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenDeepThink, a population-based test-time compute framework that samples multiple reasoning traces in parallel, aggregates LLM pairwise judgments via the Bradley-Terry model to produce a global ranking, preserves top-ranked candidates, mutates the top three-quarters using natural-language critiques from the comparisons, and discards the bottom quarter. It reports that this procedure raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points after eight sequential rounds on the newly released CF-73 benchmark (73 expert-rated problems with 99% local-verifier agreement) and shows transfer to other models with gains concentrated in objectively verifiable domains.
Significance. If the central performance claim is robust, the work supplies a concrete empirical demonstration that breadth scaling via pairwise ranking can improve LLM reasoning on verifiable tasks without ground-truth verifiers. The release of CF-73 and the observation that gains reverse in subjective domains are useful contributions. The absence of retuning across model strengths is a practical strength.
major comments (1)
- [§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.
minor comments (2)
- The abstract and experimental section omit error bars, ablation tables, and the precise experimental protocol (temperature, number of pairs per round, mutation prompt template). Adding these would allow readers to assess reproducibility.
- [§3] Notation for the Bradley-Terry aggregation (e.g., how vote counts are converted to scores and how the top-three-quarters cutoff is applied) should be stated explicitly with a short equation or pseudocode in §3.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the practical strengths of the framework and the utility of the CF-73 release. We address the major comment on intermediate measurements below and will incorporate the requested analyses in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.
Authors: We agree that direct measurements of correctness at intermediate rounds would strengthen the isolation of Bradley-Terry aggregation as the primary driver. The current manuscript focuses on the cumulative +405 Elo improvement and transfer results to establish overall efficacy. To address this, we will add in the revised §4 pass-rate curves and verifier-agreement metrics (computed post-hoc against the 99% reliable local evaluator on CF-73) comparing BT-selected candidates against random sampling and pointwise LLM scoring baselines at each of the eight rounds. These can be derived from the existing experimental traces without new model calls. revision: yes
Circularity Check
No circularity: empirical pipeline with independent experimental validation
full rationale
The paper presents OpenDeepThink as an empirical test-time scaling procedure: sample candidates, obtain LLM pairwise judgments, aggregate via standard Bradley-Terry model into a ranking, preserve top candidates, mutate the top three-quarters using the produced critiques, and discard the bottom quarter. Performance is measured by end-to-end Elo gains on the CF-73 benchmark, which supplies independent local verifiers with 99% agreement to official verdicts. No equations, fitted parameters, or self-citations are shown that would make the reported +405 Elo gain a tautological consequence of the method's own construction. The central mechanism (BT aggregation guiding selection and mutation) remains falsifiable against ground-truth correctness and does not reduce to renaming or self-definition.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of sequential rounds
- mutation fraction
axioms (1)
- domain assumption LLM pairwise comparisons can be aggregated via Bradley-Terry into a global ranking that is more reliable than pointwise scoring.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173,
work page 2023
-
[9]
Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026
Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,
-
[10]
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
G-eval: Nlg evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,
work page 2023
-
[13]
Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025
Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdi- nov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,
-
[14]
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Al- phaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,
Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,
-
[18]
15 OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al.v 1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304,
-
[19]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,
-
[22]
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,
-
[24]
Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,
-
[25]
Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,
-
[26]
centered loosely on the published rating of Gemini 3.1 Pro, optimizing the posterior with scipy.optimize.minimize scalar over the bounded interval[1000, 5000]. We report two scenarios. For gen-0 pass@1, the per-problem likelihood is Binomial with n= 20 independent gen-0 samples andkaccepted; this measures the rating implied by naive sampling. For the fina...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.