arxiv: 2604.05868 · v1 · submitted 2026-04-07 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

Xiangming Gu , Soham De , Larisa Markeeva , Petar Veli\v{c}kovi\'c , Razvan Pascanu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords large reasoning modelsparallel samplingsequential samplingperformance gapexplorationmath reasoningcoding taskssampling strategies

0 comments

The pith

The performance gap between parallel and sequential sampling in large reasoning models is primarily due to reduced exploration in sequential approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models often need multiple samples to solve hard math and coding problems. Parallel sampling generates independent answers and outperforms sequential sampling, where each new sample builds on prior ones, even though sequential sampling has greater representational power. The paper tests three explanations: the aggregator used to combine answers, the longer context required in sequential mode, and reduced exploration caused by conditioning on previous answers. Experiments across model families show that only the lack of exploration consistently explains the gap. This matters because it identifies a concrete limit on how chained sampling uses prior attempts to improve solutions.

Core claim

Parallel sampling outperforms sequential sampling in large reasoning models on math and coding tasks despite sequential sampling's greater power. Controlled experiments on Qwen3, DeepSeek-R1 distilled models, and Gemini 2.5 isolate the effects of the aggregator operator, context length, and conditioning on prior answers. The results indicate that aggregation and context length are not the main drivers, while conditioning leads to less exploration and accounts for most of the observed performance difference.

What carries the argument

The hypothesis that sequential sampling reduces exploration by conditioning each new sample on previous answers, isolated through targeted comparisons against aggregator and context-length effects.

If this is right

Sequential sampling performance can be improved by introducing mechanisms that maintain answer diversity across steps.
Parallel sampling remains the more reliable strategy for maximizing solution quality on challenging reasoning problems.
The effective search space in sequential inference is narrower than the model's capacity would suggest because of conditioning.
Inference pipelines for large reasoning models should prioritize independent sampling paths when exploration is the bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This points to a possible need for training methods that encourage diverse reasoning trajectories usable in sequential chains.
Hybrid sampling strategies that begin with parallel draws before chaining could combine the strengths of both approaches.
The finding may extend to other multi-step reasoning settings where prior outputs risk narrowing the search space.

Load-bearing premise

The chosen empirical tests on Qwen3, DeepSeek-R1 distilled models, and Gemini 2.5 across math and coding domains sufficiently isolate lack of exploration from confounding factors such as prompt formatting or aggregation details.

What would settle it

An experiment that increases exploration in sequential sampling, for example by raising temperature or adding diversity penalties while holding context length and aggregator fixed, and measures whether the performance gap with parallel sampling closes.

Figures

Figures reproduced from arXiv: 2604.05868 by Larisa Markeeva, Petar Veli\v{c}kovi\'c, Razvan Pascanu, Soham De, Xiangming Gu.

**Figure 2.** Figure 2: Comparisons of AIME2025 performance between parallel and sequential sampling [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparisons of LiveCodeBench v5 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of AIME2025 performance between parallel and sequential sampling [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparisons of LiveCodeBench v5 performance of Gemini 2.5 Flash/DeepSeek [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: (Left) input context length, (Middle) overall context length, (Right) thinking traces length when using Gemini 2.5 Flash on LiveCodeBench v5 under parallel and sequential sampling. Even though there is a clear performance gap between Markov sequential sampling and parallel sampling, the input context lengths are similar. In addition, the overall context length of auto-regressive sampling is not significan… view at source ↗

**Figure 7.** Figure 7: An example (with content truncation) of two solutions in an auto-regressive [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Comparisons of LiveCodeBench v5 performance of Gemini 2.5 Flash/DeepSeek-R1- [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: Performance of Gemini 2.5 Flash on LiveCodeBench v5 using parallel sampling with different rounds of previous samples in input context. Here “Round N” refers to N previous solutions in a sequence chain. Best-of-N aggregation with both public and private tests for rewarding is applied here. High-quality feedback for failure. We have shown that sequential sampling with self-refinement feedback or running er… view at source ↗

**Figure 10.** Figure 10: Visualization of induction heads in Qwen3-14B using auto-regressive sequential [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of induction heads in Qwen3-14B using auto-regressive sequential [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Comparisons of AIME2025 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Comparisons of AIME2025 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Code generation prompt of Gemini 2.5 for LiveCodeBench. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Code generation prompt of DeepSeek-R1-Distill Qwen models for Live [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Code generation prompt of Gemini 2.5 for LiveCodeBench in Markov sequential [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Feedback prompt design when there are running errors in sequential sampling. [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt design for generating test cases based on public tests. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Comparisons of LiveCodeBench v5 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: Comparisons of LiveCodeBench v5 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: Comparisons of LiveCodeBench v5 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗

**Figure 22.** Figure 22: Comparisons of AIME2025 performance between parallel and sequential [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗

**Figure 23.** Figure 23: Comparisons (with different difficulty levels) of LiveCodeBench v5 performance [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗

**Figure 24.** Figure 24: Comparisons (with different difficulty levels) of LiveCodeBench v5 performance [PITH_FULL_IMAGE:figures/full_fig_p019_24.png] view at source ↗

**Figure 25.** Figure 25: Comparisons (with different difficulty levels) of LiveCodeBench v5 performance [PITH_FULL_IMAGE:figures/full_fig_p020_25.png] view at source ↗

**Figure 26.** Figure 26: Comparisons (with different difficulty levels) of LiveCodeBench v5 performance [PITH_FULL_IMAGE:figures/full_fig_p020_26.png] view at source ↗

**Figure 27.** Figure 27: (Left) input context length, (Middle) overall context length, (Right) thinking traces length when using Gemini 2.5 Flash on AIME2025 under parallel and sequential sampling [PITH_FULL_IMAGE:figures/full_fig_p021_27.png] view at source ↗

**Figure 28.** Figure 28: (Left) input context length, (Middle) overall context length, (Right) thinking traces length when using Qwen3-14B on AIME2025 under parallel and sequential sampling [PITH_FULL_IMAGE:figures/full_fig_p021_28.png] view at source ↗

**Figure 29.** Figure 29: (Left) input context length, (Middle) overall context length, (Right) thinking traces length when using DeepSeek-R1-Distill Qwen-14B on AIME2025 under parallel and sequential sampling. C.3 Solution exploration “Laziness” of LRMs in sequential sampling. When solving the question 3 from LiveCodeBench using Gemini 2.5 Pro, the 5-th, 7-th, and 8-th solutions contain the same generated code program, as shown … view at source ↗

**Figure 30.** Figure 30: (Left) input context length, (Middle) overall context length, (Right) thinking traces length when using Gemini 2.5 Pro on LiveCodeBench v5 under parallel/sequential sampling [PITH_FULL_IMAGE:figures/full_fig_p022_30.png] view at source ↗

**Figure 31.** Figure 31: Comparisons (with different difficulty levels) of LiveCodeBench v5 performance [PITH_FULL_IMAGE:figures/full_fig_p022_31.png] view at source ↗

**Figure 32.** Figure 32: Solution generated by Gemini 2.5 Pro repeatedly appear in the auto-regressive [PITH_FULL_IMAGE:figures/full_fig_p023_32.png] view at source ↗

**Figure 33.** Figure 33: (Continued) solution generated by Gemini 2.5 Pro repeatedly appear in the [PITH_FULL_IMAGE:figures/full_fig_p024_33.png] view at source ↗

**Figure 34.** Figure 34: (Continued) solution generated by Gemini 2.5 Pro repeatedly appear in the [PITH_FULL_IMAGE:figures/full_fig_p025_34.png] view at source ↗

read the original abstract

Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Parallel sampling beats sequential in these LRMs mainly because sequential conditioning cuts exploration, shown by ablations that rule out aggregator and context length across several models.

read the letter

The main thing to take from this paper is that the gap between parallel and sequential sampling in large reasoning models is driven more by reduced exploration in the sequential case than by the choice of aggregator or by context length issues. The authors start from the observation that parallel sampling works better even though sequential should in theory have more power since it can build on previous attempts. They propose three hypotheses and test them on Qwen3, DeepSeek-R1 distilled models, and Gemini 2.5 across math and coding questions. By trying different aggregators and controlling for context length, they show those factors do not account for most of the difference. This leaves the conditioning effect that limits exploration as the stronger explanation. What the paper does well is the systematic comparison across multiple models and task types. The results hold up in the same direction for different sizes and families, which adds some weight to the claim. It is also helpful that they reference prior work on the basic performance gap and then focus on why it happens. The soft spots are in the support for the exploration hypothesis. The evidence comes from eliminating the other two possibilities rather than a direct measurement of exploration, such as the number of unique solutions found or the diversity in the generated trajectories. A controlled experiment that measures answer spread while keeping temperature, prompt format, and sample count the same would make the case tighter. As it stands, the conclusion is reasonable but open to other interpretations like compounding mistakes in the sequential path. This kind of work is useful for people building better inference pipelines for reasoning models. Anyone trying to improve sampling strategies for math or code problems can use the findings to prioritize maintaining diversity. It deserves a serious referee because the question is relevant to current LLM use and the experiments cover enough ground to be worth detailed feedback, even with the need for stronger mechanistic evidence.

Referee Report

2 major / 2 minor

Summary. The paper compares parallel and sequential sampling in Large Reasoning Models (LRMs) on math and coding tasks. It observes that parallel sampling outperforms sequential sampling despite the latter's greater representational power. The authors test three hypotheses for the gap—(i) aggregator operator differences, (ii) harm from longer contexts in sequential sampling, and (iii) reduced exploration from conditioning on prior answers—via ablations on Qwen3, DeepSeek-R1 distilled models, and Gemini 2.5. They conclude that lack of exploration is the primary cause.

Significance. If the central claim holds, the work would offer useful guidance on inference strategies for LRMs by emphasizing the value of maintaining answer diversity. The multi-model and multi-domain empirical evaluation is a strength that broadens the applicability of the observations.

major comments (2)

[Abstract] The isolation of lack of exploration as the main cause (hypothesis iii) rests on ablations showing that aggregator choice and context length do not close the gap. However, these ablations are described without quantitative details, error bars, or explicit metrics on performance differences, leaving the elimination argument weakly supported as noted in the abstract's summary of findings.
[Empirical evidence sections] No direct, controlled measurement of exploration (e.g., entropy of answer distributions, fraction of unique correct solutions, or trajectory diversity) is reported while holding prompt template, temperature, and sample count fixed and disabling conditioning in the sequential arm. This makes the central claim rest on indirect evidence by elimination, allowing confounds such as compounding early errors to remain unruled out.

minor comments (2)

[Abstract] The abstract refers to 'various model families and sizes' and 'question domains' but does not list the precise model sizes, number of questions per domain, or sampling parameters used in the comparisons.
[Throughout] Consider including statistical significance tests or variance estimates alongside the reported performance gaps to allow readers to assess the reliability of the observed differences across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions where they strengthen the presentation of our results without altering the core findings.

read point-by-point responses

Referee: [Abstract] The isolation of lack of exploration as the main cause (hypothesis iii) rests on ablations showing that aggregator choice and context length do not close the gap. However, these ablations are described without quantitative details, error bars, or explicit metrics on performance differences, leaving the elimination argument weakly supported as noted in the abstract's summary of findings.

Authors: We agree that the ablation results would be more convincing with explicit quantitative support. In the revised manuscript we have expanded the relevant sections and tables to report exact performance deltas (including means and standard deviations across repeated runs) for each hypothesis test on Qwen3, DeepSeek-R1, and Gemini 2.5. These additions make the elimination argument concrete and show that neither aggregator choice nor context length accounts for the observed gap. revision: yes
Referee: [Empirical evidence sections] No direct, controlled measurement of exploration (e.g., entropy of answer distributions, fraction of unique correct solutions, or trajectory diversity) is reported while holding prompt template, temperature, and sample count fixed and disabling conditioning in the sequential arm. This makes the central claim rest on indirect evidence by elimination, allowing confounds such as compounding early errors to remain unruled out.

Authors: We acknowledge the value of direct exploration metrics. Our experimental design already holds prompt template, temperature, and sample count fixed; the only systematic difference between arms is the conditioning step inherent to sequential sampling. In the revision we have added direct measurements of answer diversity (fraction of unique correct solutions) and entropy of the generated answer distributions under these controlled conditions. These metrics confirm substantially lower exploration in the sequential case. On the potential confound of compounding early errors, the context-length ablation already evaluates performance when long histories are supplied, and the gap remains; this indicates that reduced exploration, rather than error accumulation alone, is the dominant factor. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical hypothesis testing

full rationale

The paper conducts an empirical study comparing parallel and sequential sampling in LRMs. It states three hypotheses about the performance gap (aggregator choice, context length, and reduced exploration) and evaluates them via experiments across Qwen3, DeepSeek-R1, and Gemini 2.5 on math/coding tasks. Conclusions are drawn from observed performance differences after ablating specific factors. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The argument relies on direct experimental comparisons rather than any self-referential reduction, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in LLM evaluation and sampling studies; no free parameters, new axioms, or invented entities are introduced.

axioms (1)

domain assumption Performance differences between sampling strategies can be measured reliably via accuracy on math and coding benchmarks
Used to compare parallel and sequential approaches across models

pith-pipeline@v0.9.0 · 5535 in / 1116 out tokens · 27733 ms · 2026-05-10T18:31:18.178923+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The empirical evidence ... suggests that the aggregation and context length do not seem to be the main culprit ... the lack of exploration seems to play a considerably larger role
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

induction heads ... pattern copying behavior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 29 canonical work pages · 12 internal anchors

[1]

Aime 2025.https://artofproblemsolving.com/wiki/index.php/AIME Problems and Solutions,

AoPS. Aime 2025.https://artofproblemsolving.com/wiki/index.php/AIME Problems and Solutions,

2025
[2]

Sets: Leveraging self-verification and self-correction for improved test-time scaling.arXiv preprint arXiv:2501.19306, 2025

Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan¨O Arık. Sets: Leveraging self-verification and self-correction for improved test-time scaling. arXiv preprint arXiv:2501.19306,

work page arXiv
[3]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Sch¨arli, and Denny Zhou. Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128,

work page internal anchor Pith review arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233,

work page arXiv
[6]

Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training.arXiv preprint arXiv:2309.17179,

work page arXiv
[7]

Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417, 2025

Aryo Pradipta Gema, Alexander H¨agele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, et al. Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417,

work page arXiv
[8]

Does thinking more always help? mirage of test-time scaling in reasoning models,

Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, and Amrit Singh Bedi. Does thinking more always help? understanding test-time scaling in reasoning models. arXiv preprint arXiv:2506.04210,

work page arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, et al. Pacore: Learning to scale test-time compute with parallel coordinated reasoning.arXiv preprint arXiv:2601.05593,

work page arXiv
[11]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review arXiv
[12]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Gemini Embedding: Generalizable Embeddings from Gemini

10 Preprint. Under review. Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hern´andez ´Abrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini.arXiv preprint arXiv:2503.07891,

work page internal anchor Pith review arXiv
[15]

Gonzalez, and Ion Stoica

Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E Gonzalez, and Ion Stoica. S*: Test time scaling for code generation.arXiv preprint arXiv:2502.14382, 1(2), 2025a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Cheng- hao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim...

work page arXiv
[16]

Llms can generate a better answer by aggregating their own responses.arXiv preprint arXiv:2503.04104, 2025b

Zichong Li, Xinyu Feng, Yuheng Cai, Zixuan Zhang, Tianyi Liu, Chen Liang, Weizhu Chen, Haoyu Wang, and Tuo Zhao. Llms can generate a better answer by aggregating their own responses.arXiv preprint arXiv:2503.04104, 2025b. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming...

work page arXiv
[17]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand `es, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286–20332,

2025
[18]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ng ˆan V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review arXiv
[19]

https://transformer-circuits.pub/2022/in- context-learning-and-induction-heads/index.html. OpenAI. New embedding models and api updates.https://openai.com/index/new-embedding- models-and-api-updates/,

2022
[20]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Deepseekmath-v2: Towards self-verifiable mathematical reasoning,

Zhihong Shao, Yuxiang Luo, Chengda Lu, ZZ Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, and Xiaokang Zhang. Deepseekmath-v2: Towards self-verifiable mathematical reasoning.arXiv preprint arXiv:2511.22570,

work page arXiv
[23]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Under review

11 Preprint. Under review. Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, and Xiangang Li. Think twice: Enhancing llm reasoning by scaling multi-round test-time thinking.arXiv preprint arXiv:2503.19855,

work page arXiv
[25]

Recursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626, 2025

Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, et al. Recursive self-aggregation unlocks deep thinking in large language models.arXiv preprint arXiv:2509.26626,

work page arXiv
[26]

Scaling over scaling: Ex- ploring test-time scaling pareto in large reasoning models.arXiv preprint arXiv:2505.20522, 2025a

Jian Wang, Boyan Zhu, Chak Tou Leong, Yongqi Li, and Wenjie Li. Scaling over scaling: Ex- ploring test-time scaling pareto in large reasoning models.arXiv preprint arXiv:2505.20522, 2025a. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in la...

work page arXiv
[27]

CURE: Co-evolving LLM coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136, 2025a

Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang. Co-evolving llm coder and unit tester via reinforcement learning.arXiv preprint arXiv:2506.03136, 2025b. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural i...

work page arXiv
[28]

Para- thinker: Native parallel thinking as a new paradigm to scale llm test-time compute.arXiv preprint arXiv:2509.04475,

Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, and Yuanchun Li. Parathinker: Native parallel thinking as a new paradigm to scale llm test-time compute. arXiv preprint arXiv:2509.04475,

work page arXiv
[29]

Deepseek-prover-v1

Huajian Xin, ZZ Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.arXiv preprint arXiv:2408.08152,

work page arXiv
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

The majority is not always right: Rl training for solution aggregation.arXiv preprint arXiv:2509.06870, 2025

Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: Rl training for solution aggregation.arXiv preprint arXiv:2509.06870,

work page arXiv
[32]

Parallel-r1: Towards parallel thinking via reinforcement learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980,

work page arXiv
[33]

Under review

12 Preprint. Under review. A Related Work Our work is related to the field of inference test-time scaling in LLMs, especially for those Large Reasoning Models (LRMs). Sequential inference scaling.Sequential test-time scaling typically means that the later com- putations rely on earlier ones. With the success of reinforcement learning (Schulman et al., 201...

2017
[34]

parallel thinking

in LRMs, this refers to scaling the number of tokens in the reasoning traces, e.g., chains-of-thoughts (Wei et al., 2022). Jaech et al. (2024); Guo et al. (2025); Comanici et al. (2025); Muennighoff et al. (2025) showed that with increasing thinking token budget in the sequential reasoning trace, LRMs also demonstrate increasing perfor- mance gains during...

2022
[35]

Tunnel Vision

combines both parallel and sequential sampling to evolve the coding solutions for scientific and algorithmic discovery. Comparing parallel/sequential sampling.Huang et al. (2023) represents the early explo- ration by comparing sequential and parallel sampling in LLM reasoning. They claim that LLMs cannot self-correct their previous solutions in sequential...

2023