OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Huanzhi Mao; Jingbo Shang; Kaiyuan Liu; Qiuyang Mang; Shang Zhou; Wenhao Chai

arxiv: 2605.15177 · v2 · pith:EYH5ATMUnew · submitted 2026-05-14 · 💻 cs.AI

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

Shang Zhou , Wenhao Chai , Kaiyuan Liu , Huanzhi Mao , Qiuyang Mang , Jingbo Shang This is my paper

Pith reviewed 2026-05-20 20:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time compute scalingLLM reasoningBradley-Terry modelpairwise comparisonparallel candidate samplingcode generationpopulation-based evolutionverifiable benchmarks

0 comments

The pith

OpenDeepThink improves LLM reasoning performance by aggregating pairwise LLM judgments into Bradley-Terry rankings to select and evolve the best candidates from parallel samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenDeepThink as a way to scale test-time compute breadth by generating multiple reasoning candidates in parallel and then ranking them without any external verifier. It has the LLM compare random pairs of candidates and aggregates those votes through the Bradley-Terry model to produce a global ranking. Top-ranked traces are kept, the upper three-quarters receive natural-language mutations drawn from the comparison critiques, and the bottom quarter is dropped. Eight rounds of this process raise Gemini 3.1 Pro's effective Codeforces Elo by 405 points in roughly 27 minutes of wall-clock time. The same pipeline works on both weaker and stronger models and shows larger gains on domains with clear right-or-wrong answers.

Core claim

OpenDeepThink is a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. This raises Gemini 3.1 Pro's effective Codeforces Elo by 405 points in eight sequential LLM-call rounds.

What carries the argument

Bradley-Terry aggregation of LLM pairwise comparison votes, which converts noisy head-to-head judgments into a global ranking that drives both selection and natural-language mutation of candidate reasoning traces.

If this is right

The same Bradley-Terry selection and mutation loop transfers to other models without parameter changes.
Gains concentrate in objectively verifiable domains and can reverse in subjective ones.
Eight LLM-call rounds suffice to produce a 405-point Elo lift on Codeforces-style problems.
A new 73-problem benchmark with International Grandmaster annotations and 99 percent local-evaluation agreement is now available for further testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be stacked with depth-scaling techniques that extend single traces rather than breadth-scaling multiple ones.
If pairwise aggregation proves reliable, similar self-comparison loops might help in open-ended generation tasks where no automatic scorer exists.
The released CF-73 set offers a compact, high-agreement test bed for measuring whether any ranking method aligns with expert human judgment.

Load-bearing premise

LLM pairwise judgments, once aggregated via Bradley-Terry, produce rankings accurate enough to guide effective selection and natural-language mutation without access to ground-truth verifiers.

What would settle it

Run the full eight-round OpenDeepThink pipeline on the CF-73 problems while also scoring every final candidate against the official Codeforces verdicts; if the Bradley-Terry top-ranked traces do not solve substantially more problems than randomly chosen or pointwise-scored traces, the central claim is false.

read the original abstract

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable pipeline for breadth scaling via pairwise Bradley-Terry selection and critique mutation, with reported Codeforces gains and a useful dataset release, but skips the key check on whether the rankings actually track correctness.

read the letter

The paper's main idea is a population method for test-time scaling that generates multiple reasoning traces in parallel, has the LLM judge random pairs, aggregates the votes with Bradley-Terry to produce a ranking, keeps the top candidates, and mutates the top three-quarters using the natural-language critiques that came out of the pairwise comparisons. The bottom quarter gets dropped and the process repeats for a fixed number of rounds. They report that this lifts Gemini 3.1 Pro by 405 Elo points on Codeforces problems after eight rounds, with the gains showing up mainly on verifiable tasks and reversing on subjective ones. The method appears to transfer across model sizes without retuning, and they release the CF-73 set of 73 expert-annotated problems that has 99 percent agreement with official verdicts. That dataset release is the clearest concrete output here and will be handy for anyone working on verifiable reasoning benchmarks. The specific mix of random-pair judging, Bradley-Terry aggregation, and critique-driven mutation on the retained candidates is not something I recall seeing laid out exactly this way in the earlier literature the abstract cites, so the combination counts as new even if the individual pieces are familiar. The end-to-end numbers are concrete and the wall-clock cost is modest at roughly 27 minutes. The stress-test concern lands. The whole selection-plus-mutation loop only makes sense if the Bradley-Terry ranking puts correct answers higher than incorrect ones at a rate better than random or simple pointwise scoring. The abstract and the reported results give no direct measurement of that correlation at intermediate steps, no ablations that isolate the ranking step from the mutation operator or from raw sampling volume, and no error bars. Without those, it is hard to tell how much of the +405 Elo comes from the claimed mechanism versus just running more samples and rewriting the better-looking ones. The paper frames the problem cleanly and the empirical recipe is easy to follow, but the central performance claim rests on an untested assumption about ranking quality. This is aimed at groups working on test-time compute scaling, especially people who want to explore breadth as a complement to longer single traces. Readers who need practical code or a new verifiable dataset will get immediate value even if they treat the ranking story as provisional. The work shows honest engagement with the selection bottleneck and ships reproducible artifacts, so it deserves a serious referee who can ask for the missing correlation plots and ablations. I would send it to peer review with those requests rather than desk-reject it.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces OpenDeepThink, a population-based test-time compute framework that samples multiple reasoning traces in parallel, aggregates LLM pairwise judgments via the Bradley-Terry model to produce a global ranking, preserves top-ranked candidates, mutates the top three-quarters using natural-language critiques from the comparisons, and discards the bottom quarter. It reports that this procedure raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points after eight sequential rounds on the newly released CF-73 benchmark (73 expert-rated problems with 99% local-verifier agreement) and shows transfer to other models with gains concentrated in objectively verifiable domains.

Significance. If the central performance claim is robust, the work supplies a concrete empirical demonstration that breadth scaling via pairwise ranking can improve LLM reasoning on verifiable tasks without ground-truth verifiers. The release of CF-73 and the observation that gains reverse in subjective domains are useful contributions. The absence of retuning across model strengths is a practical strength.

major comments (1)

[§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.

minor comments (2)

The abstract and experimental section omit error bars, ablation tables, and the precise experimental protocol (temperature, number of pairs per round, mutation prompt template). Adding these would allow readers to assess reproducibility.
[§3] Notation for the Bradley-Terry aggregation (e.g., how vote counts are converted to scores and how the top-three-quarters cutoff is applied) should be stated explicitly with a short equation or pseudocode in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the practical strengths of the framework and the utility of the CF-73 release. We address the major comment on intermediate measurements below and will incorporate the requested analyses in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments on CF-73): The manuscript reports the end-to-end +405 Elo gain but provides no direct measurement (e.g., pass-rate or verifier-agreement curves) of whether BT-ranked candidates exhibit higher correctness than random or pointwise baselines at intermediate rounds. This correlation is load-bearing for the claim that Bradley-Terry aggregation, rather than mutation volume or sampling alone, drives the improvement.

Authors: We agree that direct measurements of correctness at intermediate rounds would strengthen the isolation of Bradley-Terry aggregation as the primary driver. The current manuscript focuses on the cumulative +405 Elo improvement and transfer results to establish overall efficacy. To address this, we will add in the revised §4 pass-rate curves and verifier-agreement metrics (computed post-hoc against the 99% reliable local evaluator on CF-73) comparing BT-selected candidates against random sampling and pointwise LLM scoring baselines at each of the eight rounds. These can be derived from the existing experimental traces without new model calls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent experimental validation

full rationale

The paper presents OpenDeepThink as an empirical test-time scaling procedure: sample candidates, obtain LLM pairwise judgments, aggregate via standard Bradley-Terry model into a ranking, preserve top candidates, mutate the top three-quarters using the produced critiques, and discard the bottom quarter. Performance is measured by end-to-end Elo gains on the CF-73 benchmark, which supplies independent local verifiers with 99% agreement to official verdicts. No equations, fitted parameters, or self-citations are shown that would make the reported +405 Elo gain a tautological consequence of the method's own construction. The central mechanism (BT aggregation guiding selection and mutation) remains falsifiable against ground-truth correctness and does not reduce to renaming or self-definition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework depends on the reliability of LLM pairwise judgments and the effectiveness of natural-language mutation from comparison critiques; no explicit free parameters beyond the reported 8 rounds and 3/4 mutation fraction are stated.

free parameters (2)

number of sequential rounds
Fixed at eight for the reported Gemini experiment.
mutation fraction
Top three-quarters of ranked candidates are mutated.

axioms (1)

domain assumption LLM pairwise comparisons can be aggregated via Bradley-Terry into a global ranking that is more reliable than pointwise scoring.
Invoked to justify the selection step in the absence of ground-truth verifiers.

pith-pipeline@v0.9.0 · 5749 in / 1258 out tokens · 46745 ms · 2026-05-20T20:47:29.935674+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 14 internal anchors

[1]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

work page arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173,

work page 2023
[9]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

work page arXiv
[10]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

work page 2023
[13]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdi- nov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,

work page arXiv
[14]

Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Al- phaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

work page arXiv
[18]

15 OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al.v 1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304,

work page arXiv
[19]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

work page arXiv
[22]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

work page arXiv
[24]

Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

work page arXiv
[25]

Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

work page arXiv
[26]

We report two scenarios

centered loosely on the published rating of Gemini 3.1 Pro, optimizing the posterior with scipy.optimize.minimize scalar over the bounded interval[1000, 5000]. We report two scenarios. For gen-0 pass@1, the per-problem likelihood is Binomial with n= 20 independent gen-0 samples andkaccepted; this measures the rating implied by naive sampling. For the fina...

work page 2023

[1] [1]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

Xingyu Dang, Rohit Agarwal, Rodrigo Porto, Anirudh Goyal, Liam H Fowl, and Sanjeev Arora. Escaping the cognitive well: Efficient competition math with off-the-shelf models.arXiv preprint arXiv:2602.16793,

work page arXiv

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173,

work page 2023

[9] [9]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

work page arXiv

[10] [10]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522,

work page 2023

[13] [13]

Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123, 2025

Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdi- nov, Manzil Zaheer, Sanjeev Arora, and Anirudh Goyal. Rethinking thinking tokens: Llms as improvement operators.arXiv preprint arXiv:2510.01123,

work page arXiv

[14] [14]

Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, et al. Squeeze evolve: Unified multi-model orchestration for verifier-free evolution.arXiv preprint arXiv:2604.07725,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Al- phaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning.arXiv preprint arXiv:2506.09014,

work page arXiv

[18] [18]

15 OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan, Xiaoxia Wu, Junxiong Wang, Alpay Ariyak, Qingyang Wu, Samir Khaki, et al.v 1: Unifying generation and self-verification for parallel reasoners.arXiv preprint arXiv:2603.04304,

work page arXiv

[19] [19]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

work page arXiv

[22] [22]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

work page arXiv

[24] [24]

Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, and Shuxin Zheng. Population-evolve: a parallel sampling and evolutionary method for llm math reasoning.arXiv preprint arXiv:2512.19081,

work page arXiv

[25] [25]

Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928,

work page arXiv

[26] [26]

We report two scenarios

centered loosely on the published rating of Gemini 3.1 Pro, optimizing the posterior with scipy.optimize.minimize scalar over the bounded interval[1000, 5000]. We report two scenarios. For gen-0 pass@1, the per-problem likelihood is Binomial with n= 20 independent gen-0 samples andkaccepted; this measures the rating implied by naive sampling. For the fina...

work page 2023