pith. sign in

arxiv: 2605.15177 · v2 · pith:EYH5ATMUnew · submitted 2026-05-14 · 💻 cs.AI

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3

classification 💻 cs.AI
keywords test-time compute scalingLLM reasoningBradley-Terry modelpairwise comparisonparallel candidate samplingcode generationpopulation-based evolutionverifiable benchmarks
0
0 comments X

The pith

OpenDeepThink improves LLM reasoning performance by aggregating pairwise LLM judgments into Bradley-Terry rankings to select and evolve the best candidates from parallel samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenDeepThink as a way to scale test-time compute breadth by generating multiple reasoning candidates in parallel and then ranking them without any external verifier. It has the LLM compare random pairs of candidates and aggregates those votes through the Bradley-Terry model to produce a global ranking. Top-ranked traces are kept, the upper three-quarters receive natural-language mutations drawn from the comparison critiques, and the bottom quarter is dropped. Eight rounds of this process raise Gemini 3.1 Pro's effective Codeforces Elo by 405 points in roughly 27 minutes of wall-clock time. The same pipeline works on both weaker and stronger models and shows larger gains on domains with clear right-or-wrong answers.

Core claim

OpenDeepThink is a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. This raises Gemini 3.1 Pro's effective Codeforces Elo by 405 points in eight sequential LLM-call rounds.

What carries the argument

Bradley-Terry aggregation of LLM pairwise comparison votes, which converts noisy head-to-head judgments into a global ranking that drives both selection and natural-language mutation of candidate reasoning traces.

If this is right

  • The same Bradley-Terry selection and mutation loop transfers to other models without parameter changes.
  • Gains concentrate in objectively verifiable domains and can reverse in subjective ones.
  • Eight LLM-call rounds suffice to produce a 405-point Elo lift on Codeforces-style problems.
  • A new 73-problem benchmark with International Grandmaster annotations and 99 percent local-evaluation agreement is now available for further testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be stacked with depth-scaling techniques that extend single traces rather than breadth-scaling multiple ones.
  • If pairwise aggregation proves reliable, similar self-comparison loops might help in open-ended generation tasks where no automatic scorer exists.
  • The released CF-73 set offers a compact, high-agreement test bed for measuring whether any ranking method aligns with expert human judgment.

Load-bearing premise

LLM pairwise judgments, once aggregated via Bradley-Terry, produce rankings accurate enough to guide effective selection and natural-language mutation without access to ground-truth verifiers.

What would settle it

Run the full eight-round OpenDeepThink pipeline on the CF-73 problems while also scoring every final candidate against the official Codeforces verdicts; if the Bradley-Terry top-ranked traces do not solve substantially more problems than randomly chosen or pointwise-scored traces, the central claim is false.

read the original abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces OpenDeepThink, a population-based test-time compute framework that samples multiple reasoning traces in parallel, aggregates LLM pairwise judgments via the Bradley-Terry model to produce a global ranking, preserves top-ranked candidates, mutates the top three-quarters using natural-language critiques from the comparisons, and discards the bottom quarter. It reports that this procedure raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points after eight sequential rounds on the newly released CF-73 benchmark (73 expert-rated problems with 99% local-verifier agreement) and shows transfer to other models with gains concentrated in objectively verifiable domains.

Significance. If the central performance claim is robust, the work supplies a concrete empirical demonstration that breadth scaling via pairwise ranking can improve LLM reasoning on verifiable tasks without ground-truth verifiers. The release of CF-73 and the observation that gains reverse in subjective domains are useful contributions. The absence of retuning across model strengths is a practical strength.

major comments (1)
  1. [§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.
minor comments (2)
  1. The abstract and experimental section omit error bars, ablation tables, and the precise experimental protocol (temperature, number of pairs per round, mutation prompt template). Adding these would allow readers to assess reproducibility.
  2. [§3] Notation for the Bradley-Terry aggregation (e.g., how vote counts are converted to scores and how the top-three-quarters cutoff is applied) should be stated explicitly with a short equation or pseudocode in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical strengths of the framework and the utility of the CF-73 release. We address the major comment on intermediate measurements below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.

    Authors: We agree that direct measurements of correctness at intermediate rounds would strengthen the isolation of Bradley-Terry aggregation as the primary driver. The current manuscript focuses on the cumulative +405 Elo improvement and transfer results to establish overall efficacy. To address this, we will add in the revised §4 pass-rate curves and verifier-agreement metrics (computed post-hoc against the 99% reliable local evaluator on CF-73) comparing BT-selected candidates against random sampling and pointwise LLM scoring baselines at each of the eight rounds. These can be derived from the existing experimental traces without new model calls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent experimental validation

full rationale

The paper presents OpenDeepThink as an empirical test-time scaling procedure: sample candidates, obtain LLM pairwise judgments, aggregate via standard Bradley-Terry model into a ranking, preserve top candidates, mutate the top three-quarters using the produced critiques, and discard the bottom quarter. Performance is measured by end-to-end Elo gains on the CF-73 benchmark, which supplies independent local verifiers with 99% agreement to official verdicts. No equations, fitted parameters, or self-citations are shown that would make the reported +405 Elo gain a tautological consequence of the method's own construction. The central mechanism (BT aggregation guiding selection and mutation) remains falsifiable against ground-truth correctness and does not reduce to renaming or self-definition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on the reliability of LLM pairwise judgments and the effectiveness of natural-language mutation from comparison critiques; no explicit free parameters beyond the reported 8 rounds and 3/4 mutation fraction are stated.

free parameters (2)
  • number of sequential rounds
    Fixed at eight for the reported Gemini experiment.
  • mutation fraction
    Top three-quarters of ranked candidates are mutated.
axioms (1)
  • domain assumption LLM pairwise comparisons can be aggregated via Bradley-Terry into a global ranking that is more reliable than pointwise scoring.
    Invoked to justify the selection step in the absence of ground-truth verifiers.

pith-pipeline@v0.9.0 · 5749 in / 1258 out tokens · 46745 ms · 2026-05-20T20:47:29.935674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 14 internal anchors

  1. [1]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

  2. [2]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  4. [4]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  5. [5]

    Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

    Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  7. [7]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

  8. [8]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173,

  9. [9]

    Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

    Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

  10. [10]

    Large Language Models Cannot Self-Correct Reasoning Yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

  11. [11]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  12. [12]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

  13. [13]

    Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

    Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdi- nov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,

  14. [14]

    Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

    Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,

  15. [15]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Al- phaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  16. [16]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

  17. [17]

    Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

    Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

  18. [18]

    15 OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al.v 1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304,

  19. [19]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  20. [20]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

  21. [21]

    Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

    Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

  22. [22]

    Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

  23. [23]

    Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

    Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

  24. [24]

    Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

    Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

  25. [25]

    Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

    Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

  26. [26]

    We report two scenarios

    centered loosely on the published rating of Gemini 3.1 Pro, optimizing the posterior with scipy.optimize.minimize scalar over the bounded interval[1000, 5000]. We report two scenarios. For gen-0 pass@1, the per-problem likelihood is Binomial with n= 20 independent gen-0 samples andkaccepted; this measures the rating implied by naive sampling. For the fina...