arxiv: 2601.21619 · v2 · submitted 2026-01-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

Yiming Wang , Zhuosheng Zhang , Rui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords overscaling curseparallel thinkingLLM reasoningbudget predictionlatent representationssample efficiencydecoding optimization

0 comments

The pith

Sample-specific budget prediction from latent states resolves the overscaling curse in parallel LLM thinking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Parallel thinking improves LLM reasoning by sampling multiple paths and aggregating results, yet current practice assigns every input the same global budget sized for peak dataset accuracy. Many samples reach their best performance with far fewer paths, so the shared budget wastes computation on the majority of cases. The paper identifies this mismatch as the overscaling curse and demonstrates that model latent representations already encode enough signal to forecast the minimal budget each sample needs. Predicting these budgets in advance raises overall utilization while holding dataset accuracy steady and enables pre-decoding allocation that preserves parallel decoding speed.

Core claim

The central claim is that the overscaling curse arises because a single global sampling budget chosen to maximize dataset accuracy necessarily over-allocates paths to many individual samples whose accuracy saturates earlier; this contradiction between system efficacy and sample efficiency can be broken by a Latent Budget Predictor (LanBo) that reads latent representations to assign sample-specific budgets, thereby improving utilization without accuracy loss and supporting a Pre-decoding Budget Adaptation (PreAda) scheme that allocates budgets before decoding begins.

What carries the argument

Latent Budget Predictor (LanBo), a module that probes internal model representations to forecast the smallest number of parallel paths required for each input to reach its individual accuracy peak.

If this is right

Overall budget utilization rises while dataset-level accuracy stays constant.
Pre-decoding allocation becomes possible, preserving full parallelization during the generation phase.
Hardware metrics improve in both end-to-end latency and peak memory consumption.
The same predictor can be dropped into existing multi-path decoding pipelines without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internal states appear to encode per-input reasoning difficulty or convergence rate, which could be exploited by other adaptive sampling schemes.
The method may generalize to chain-of-thought or tree-of-thought variants where path count also trades off against accuracy.
Production deployments could realize direct cost savings by avoiding over-sampling on the large fraction of easy inputs.
Combining the predictor with early-stopping rules during generation might yield further efficiency gains.

Load-bearing premise

Model latent representations already contain enough information to predict the sample-specific optimal budget accurately without any extra labeled data or tuning steps that would themselves consume the budget being saved.

What would settle it

Measure the correlation between LanBo-predicted budgets and the true minimal budgets found by exhaustive per-sample search on a held-out test set; if accuracy falls when the predicted budgets replace the oracle budgets, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2601.21619 by Rui Wang, Yiming Wang, Zhuosheng Zhang.

**Figure 1.** Figure 1: The Overscaling Curse of Parallel Thinking. When evaluating an entire dataset D in a model, a large global parallelism level ND is often used to maximize overall performance, as in Episode (ii). Under this, as in Episode (i), only type-(4) samples truly benefit, since they indeed require large N to realize substantial gains. In contrast, the other sample types do not benefit because they reach their best p… view at source ↗

**Figure 2.** Figure 2: T2: Thinking Parallelism Before Parallel Thinking. We introduce trainable layer-wise estimators that predict the optimal parallelism level for each input from its final-token representations. These estimators are first trained, and each is assigned a weight based on its layer-wise validation error. During inference, after encoding the input, the layer-weighted parallelism estimate Nˆ ∗ is obtained, and the… view at source ↗

**Figure 3.** Figure 3: OverScaling Index MD across models and datasets, with detailed (N ∗ D/ND) labeled below each value. first to maximize performance regardless of computational cost and then to choose the smallest N that attains this maximum to minimize redundancy. Therefore, for a sample (x, y) ∼ PD, its sample-optimal parallelism level N∗ x is N ∗ x = min argmax N∈[1,Nmax] [Ax(N)]! . (3) We denote N∗ D = E(x,y)∼PDi [N∗ x] … view at source ↗

**Figure 4.** Figure 4: Proportion of the five sample types across datasets in Qwen3-4B. Results of other models are shown in Appendix C.3 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Estimation Results of Layer-wise Estimators. Each estimator is trained over 8 runs. Points indicate the mean, while the shaded areas indicate the standard deviation. Datasets with blue lines denote in-domain, and red lines denote out-of-domain datasets. with the lowest validation error, denoted by L ′ , and compute Nˆ ∗ x = ϕθL′ h (L′ ) T (x) · Nmax. (14) as the final estimate. This strategy considers … view at source ↗

**Figure 6.** Figure 6: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen2.5-7B on the MATH500 dataset. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of the “cost-accuracy” function Ax(N) for Type-(5) samples from Qwen2.5-7B on the MATH500 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.05 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=19, Acc=0.41 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=1, Acc=0.03 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=1, Acc=0.04 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.12 Parallelism Level N (Computati… view at source ↗

**Figure 9.** Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen2.5-7B on the AIME24 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=39, Acc=0.18 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=46, Acc=0.22 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=16, Acc=0.08 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=10, Acc=0.16 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.26 Parallelism Level N (Comput… view at source ↗

**Figure 11.** Figure 11: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen2.5-7B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=122, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=39, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=54, Acc=1.00 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen2.5-7B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=10, Acc=0.33 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=69, Acc=0.42 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen3-4B on the AIME24 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=24, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=70, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=36, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=21, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=17, Acc=1.00 Parallelism Level N (Computa… view at source ↗

**Figure 14.** Figure 14: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen3-4B on the AIME24 dataset. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of the “cost-accuracy” function Ax(N) for Type-(3) samples from Qwen3-4B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=121, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=106, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=96, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=119, Acc=1.00 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=88, Acc=1.00 Parallelism Level N (Comp… view at source ↗

**Figure 16.** Figure 16: Examples of the “cost-accuracy” function Ax(N) for Type-(4) samples from Qwen3-4B on the AIME25 dataset. 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.25 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.16 1 32 64 96 128 0.00 0.25 0.50 0.75 1.00 N=2, Acc=0.02 Parallelism Level N (Computational Cost) Sample Accuracy [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Examples of the “cost-accuracy” function Ax(N) for Type-(5) samples from Qwen3-4B on the AIME25 dataset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Proportion of the five sample types across datasets in Qwen2.5-7B. MATH500 AMC AIME24 AIME25 GPQA MMLU-Pro 0.00 0.25 0.50 0.75 1.00 Proportion 0.05 0.00 0.00 0.00 0.02 0.06 0.12 0.25 0.57 0.73 0.39 0.36 0.18 0.12 0.10 0.17 0.20 0.25 0.52 0.23 0.20 0.10 0.27 0.29 0.13 0.40 0.13 0.00 0.12 0.04 (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Proportion of the five sample types across datasets in Llama3.1-8B. MATH500 AMC AIME24 AIME25 GPQA MMLU-Pro 0.00 0.25 0.50 0.75 1.00 Proportion 0.71 0.37 0.20 0.17 0.25 0.39 0.02 0.10 0.10 0.20 0.12 0.06 0.02 0.12 0.07 0.13 0.18 0.11 0.24 0.40 0.60 0.47 0.43 0.51 0.01 0.01 0.03 0.03 0.02 0.03 (a) (b) (c) (d) (e) [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Proportion of the five sample types across datasets in Deepseek-R1-Distill-Qwen-7B. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Results under varying scaling factor r of MLP hidden size ⌊rd⌋ on the MATH500 dataset. 1/32 1/16 1/8 1/4 1/2 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 Qwen2.5-7B 1/32 1/16 1/8 1/4 1/2 6.15 6.20 6.25 6.30 6.35 6.40 6.45 6.50 Llama3.1-8B 1/32 1/16 1/8 1/4 1/2 68.00 68.05 68.10 68.15 68.20 68.25 68.30 68.35 Deepseek-R1-Distill-Qwen-7B 1/32 1/16 1/8 1/4 1/2 53.95 54.00 54.05 54.10 54.15 54.20 54.25 54.3… view at source ↗

**Figure 22.** Figure 22: Results under varying scaling factor r of MLP hidden size ⌊rd⌋ on the AIME25 dataset. Training Data Size. We also study the effect of the estimator’s training data size. In our main experiments, the estimator is trained on 5k samples; we vary this size to evaluate its effect on T2’s performance [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Results under varying training data sizes for estimators on the MATH500 dataset. 1 2 3 4 5 7.5 10 10.10 10.15 10.20 10.25 10.30 10.35 10.40 10.45 Qwen2.5-7B 1 2 3 4 5 7.5 10 6.10 6.15 6.20 6.25 6.30 6.35 6.40 6.45 Llama3.1-8B 1 2 3 4 5 7.5 10 68.10 68.15 68.20 68.25 68.30 68.35 68.40 68.45 Deepseek-R1-Distill-Qwen-7B 1 2 3 4 5 7.5 10 53.90 53.95 54.00 54.05 54.10 54.15 54.20 54.25 Qwen3-4B Training Data S… view at source ↗

**Figure 24.** Figure 24: Results under varying training data sizes for estimators on the AIME25 dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

read the original abstract

Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation (PreAda), a paradigm that allocates budgets before decoding to preserve decoding-time parallelization. LanBo substantially improves hardware-aware efficiency in latency and memory, demonstrating both its practical value and the promise of LanBo for efficient parallel decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a real mismatch between global and per-sample budgets in parallel LLM sampling and offers a latent-based predictor to cut waste, but the training of that predictor is the part that still needs proof.

read the letter

LanBo and PreAda target the fact that a single budget sized for best dataset accuracy over-provisions many individual samples in multi-path reasoning. The paper formalizes this as the overscaling curse and shows it shows up in practice. They then use the model's own latent states to predict a tighter budget per input before decoding starts, which keeps the parallel batch intact while trimming latency and memory use.

Referee Report

2 major / 2 minor

Summary. The paper identifies the 'Overscaling Curse' in parallel thinking for LLMs, where a global budget for multi-path sampling maximizes dataset accuracy but yields low utilization because many samples saturate at much smaller per-sample budgets. It proposes the Latent Budget Predictor (LanBo) that probes model latent representations to predict sample-specific optimal budgets, integrates this into Pre-decoding Budget Adaptation (PreAda) to allocate budgets before decoding, and reports gains in budget utilization, latency, and memory while preserving dataset accuracy.

Significance. If the empirical claims hold, the work offers a practical route to reconcile system-level and sample-level efficiency in parallel reasoning methods such as self-consistency. The formal analysis of the curse, the pre-decoding paradigm, and hardware-aware metrics constitute clear strengths that could influence efficient deployment of multi-path techniques.

major comments (2)

[§3.2] §3.2 (LanBo training protocol): the supervision signal for optimal per-sample budgets is not specified. If labels are obtained via exhaustive per-sample sweeps over budget values, the pre-computation cost must be measured and shown to be amortized or negligible; otherwise the net efficiency gain is unclear.
[§5.1, Table 2] §5.1, Table 2 (utilization results): the reported utilization improvement depends on LanBo prediction accuracy, yet no error analysis (e.g., fraction of over- or under-predictions, MAE on budget) is provided. Without this, it is impossible to determine whether gains arise from latent information or from test-set characteristics.

minor comments (2)

[Figure 3] Figure 3 (latency/memory plots): axis labels and legend entries are too small for readability; enlarge fonts and add error bars if multiple runs were performed.
[§2.1] Notation: the symbol B* for optimal budget is introduced without an explicit equation; add a short definition in §2.1 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our work. We address the major comments point by point below, and we will incorporate the suggested clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (LanBo training protocol): the supervision signal for optimal per-sample budgets is not specified. If labels are obtained via exhaustive per-sample sweeps over budget values, the pre-computation cost must be measured and shown to be amortized or negligible; otherwise the net efficiency gain is unclear.

Authors: We appreciate this observation. The supervision signal for LanBo is derived from per-sample sweeps on a validation set to determine the smallest budget achieving maximum accuracy for each sample. We acknowledge that the pre-computation cost was not explicitly quantified in the original submission. In the revision, we will report the time required for these sweeps and demonstrate that it is amortized over repeated use of the model on similar data distributions, leading to net efficiency gains. We will also discuss how this cost compares to the savings in inference time. revision: yes
Referee: [§5.1, Table 2] §5.1, Table 2 (utilization results): the reported utilization improvement depends on LanBo prediction accuracy, yet no error analysis (e.g., fraction of over- or under-predictions, MAE on budget) is provided. Without this, it is impossible to determine whether gains arise from latent information or from test-set characteristics.

Authors: We agree that providing an error analysis is important to substantiate the source of the gains. In the revised manuscript, we will add an analysis including the Mean Absolute Error (MAE) between predicted and optimal budgets, the percentages of over-predictions and under-predictions, and an ablation study comparing LanBo to random budget assignment. This will clarify that the improvements are due to the predictive power of the latent representations rather than inherent properties of the test set. revision: yes

Circularity Check

0 steps flagged

No circularity: LanBo prediction from latents is independent of the accuracy metric it targets

full rationale

The paper's core claim is a formal analysis of the overscaling curse followed by a proposal to predict per-sample budgets from existing model latent representations. No equations in the provided text define the predictor output in terms of the accuracy or utilization it is meant to improve, nor do any steps reduce a 'prediction' to a fitted input by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming is smuggled through citations. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review limits visibility into parameters and assumptions. No explicit free parameters, axioms, or invented entities beyond the two named methods are described.

invented entities (2)

Latent Budget Predictor (LanBo) no independent evidence
purpose: Predict sample-specific optimal sampling budgets from model latent representations
New component introduced to address the overscaling issue
Pre-decoding Budget Adaptation (PreAda) no independent evidence
purpose: Allocate per-sample budgets before decoding begins to preserve parallelization
New paradigm that integrates LanBo into the decoding pipeline

pith-pipeline@v0.9.0 · 5474 in / 1366 out tokens · 18655 ms · 2026-05-16T10:35:00.539571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LanBo probes model latent representations to predict sample-specific optimal budgets... trainable layer-wise estimators ϕθl(h(l)T(x))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Overscaling Index MD = N*D / ND; five sample types based on monotonicity of Ax(N)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 18 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms

Aggarwal, P., Madaan, A., Yang, Y ., et al. Let’s sample step by step: Adaptive-consistency for efficient reasoning and coding with llms. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12375–12396,

work page 2023
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., Ré, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[5]

Parallel scaling law for language models

Chen, M., Hui, B., Cui, Z., Yang, J., Liu, D., Sun, J., Lin, J., and Liu, Z. Parallel scaling law for language models. arXiv preprint arXiv:2505.10475, 2025a. Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. InForty-second I...

work page arXiv
[6]

Dong, H., Brandfonbrener, D., Helenowski, E., He, Y ., Ku- mar, M., Fang, H., Chi, Y ., and Sankararaman, K. A. Generalized parallel scaling with interdependent genera- tions.arXiv preprint arXiv:2510.01143, 2025a. Dong, Z., Zhou, Z., Liu, Z., Yang, C., and Lu, C. Emergent response planning in llms.arXiv preprint arXiv:2502.06258, 2025b. Fan, A., Lewis, M...

work page arXiv
[7]

Deep Think with Confidence

Fu, Y ., Wang, X., Tian, Y ., and Zhao, J. Deep think with confidence.arXiv preprint arXiv:2508.15260,

work page internal anchor Pith review arXiv
[8]

Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., and Goodman, N. D. Cognitive behaviors that enable self- improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

work page arXiv
[9]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M

Guan, X., Zhang, L. L., Liu, Y ., Shang, N., Sun, Y ., Zhu, Y ., Yang, F., and Yang, M. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

work page arXiv
[11]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

He, Z., Liang, T., Xu, J., Liu, Q., Chen, X., Wang, Y ., Song, L., Yu, D., Liang, Z., Wang, W., et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[15]

C., Choukse, E., and Ustiugov, D

Hong, C., Guo, X., Singh, A. C., Choukse, E., and Ustiugov, D. Slim-sc: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 34488–34505,

work page 2025
[16]

Scalable best-of-n selec- tion for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

Kang, Z., Zhao, X., and Song, D. Scalable best-of-n selec- tion for large language models via self-certainty.arXiv preprint arXiv:2502.18581,

work page arXiv
[17]

Parallelmuse: Agen- tic parallel thinking for deep information seeking.arXiv preprint arXiv:2510.24698, 2025a

Li, B., Zhang, D., Wu, J., Yin, W., Tao, Z., Zhao, Y ., Zhang, L., Shen, H., Fang, R., Xie, P., et al. Parallelmuse: Agen- tic parallel thinking for deep information seeking.arXiv preprint arXiv:2510.24698, 2025a. Li, Y ., Yuan, P., Feng, S., Pan, B., Wang, X., Sun, B., Wang, H., and Li, K. Escape sky-high cost: Early-stopping self-consistency for multi-s...

work page arXiv
[18]

Treepo: Bridging the gap of policy optimization and efficacy and inference effi- ciency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025b

Li, Y ., Gu, Q., Wen, Z., Li, Z., Xing, T., Guo, S., Zheng, T., Zhou, X., Qu, X., Zhou, W., et al. Treepo: Bridging the gap of policy optimization and efficacy and inference effi- ciency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025b. Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y ., Wang, R., Yang, Y ., Shi, S., and Tu, Z. Enco...

work page arXiv 2024
[19]

Muennighoff, N., Yang, Z., Shi, W., Li, X

URL https://maa.org/ maa-invitational-competitions/. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,

work page 2025
[20]

Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337,

Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Y . Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337,

work page arXiv
[21]

Hogwild! inference: Parallel llm generation via concurrent attention.arXiv preprint arXiv:2504.06261,

Rodionov, G., Garipov, R., Shutova, A., Yakushev, G., Schultheis, E., Egiazarian, V ., Sinitsin, A., Kuznedelev, D., and Alistarh, D. Hogwild! inference: Parallel llm generation via concurrent attention.arXiv preprint arXiv:2504.06261,

work page arXiv
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Layer by Layer: Uncovering Hidden Representations in Language Models

Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y ., and Shwartz-Ziv, R. Layer by layer: Uncov- ering hidden representations in language models.arXiv preprint arXiv:2502.02013,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024a

Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024a. Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. Improving text embeddings with large language models. InProceedings of the 62nd Annual Meeting of the Associa...

work page arXiv
[27]

Make every penny count: Difficulty-adaptive self-consistency for cost-efficient rea- soning

10 Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking Wang, X., Feng, S., Li, Y ., Yuan, P., Zhang, Y ., Tan, C., Pan, B., Hu, Y ., and Li, K. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient rea- soning. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 6904–6917, 2025a....

work page arXiv 2025
[28]

Sequence-to-Sequence Learning as Beam-Search Optimization

Wiseman, S. and Rush, A. M. Sequence-to-sequence learning as beam-search optimization.arXiv preprint arXiv:1606.02960,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Wu, T., Liu, Y ., Bai, J., Jia, Z., Zhang, S., Lin, Z., Wang, Y ., Zhu, S.-C., and Zheng, Z. Native parallel reasoner: Reasoning in parallelism via self-distilled reinforcement learning.arXiv preprint arXiv:2512.07461,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Xiong, M., Hu, Z., Lu, X., Li, Y ., Fu, J., He, J., and Hooi, B. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, X., An, Y ., Liu, H., Chen, T., and Chen, B. Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991, 2025b. Yao, S., Yu,...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Limo: Less is more for reasoning

Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387,

work page arXiv
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Pushing test-time scaling limits of deep search with asymmetric verification.arXiv preprint arXiv:2510.06135,

Zeng, W., He, K., Kuang, C., Li, X., and He, J. Pushing test-time scaling limits of deep search with asymmetric verification.arXiv preprint arXiv:2510.06135,

work page arXiv
[36]

Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394,

Zhang, D., Huang, X., Zhou, D., Li, Y ., and Ouyang, W. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b.arXiv preprint arXiv:2406.07394,

work page arXiv
[37]

Parallel-r1: To- wards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025

Zheng, T., Zhang, H., Yu, W., Wang, X., Dai, R., Liu, R., Bao, H., Huang, C., Huang, H., and Yu, D. Parallel- r1: Towards parallel thinking via reinforcement learning. arXiv preprint arXiv:2509.07980,

work page arXiv
[38]

Related Work A.1

11 Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking A. Related Work A.1. Parallel Thinking Parallel thinking is a test-time scaling paradigm that mainly consists of two stages:exploratory samplingandanswer generation(Li et al., 2025a). In the first stage, the most brute-force strategy is stochastic methods, where each reasonin...

work page 2022
[39]

increases the effective lookahead by maintaining multiple hypotheses at each decoding step. Other strategies, such as Tree-of-Thought (ToT) (Yao et al., 2023), Skeleton-of-Thought (SoT) (Ning et al., 2023), and Monte Carlo Tree Search (MCTS) (Zhang et al., 2024; Guan et al., 2025; Li et al., 2025b; Ding et al., 2025), build on stochastic rollouts or struc...

work page 2023
[40]

This paradigm follows two main research directions (Muennighoff et al., 2025):sequential scalingandparallel scaling

aims to enhance reasoning performance by increasing computation at inference time. This paradigm follows two main research directions (Muennighoff et al., 2025):sequential scalingandparallel scaling. Sequential scaling focuses on extending the length of a single chain-of-thought to induce slower, more deliberate thinking, thereby eliciting cognitive mecha...

work page 2025
[41]

cost-accuracy

that increase the likelihood of reaching the correct answer, through RL (Shao et al., 2024; Guo et al., 2025; Yu et al., 2025), SFT (Muennighoff et al., 2025; Ye et al., 2025; Yang et al., 2025a), or inference-time prompt forcing (Muennighoff et al., 2025; Wang et al., 2025c). Parallel scaling corresponds to parallel thinking, which is discussed in detail...

work page 2024
[42]

In the original paper, w= 4 , k= 32 , and L= 40

In Stage 3, using the stopping point as a threshold, DSC draws one sample for each easier question, while for harder questions it adaptively increases the budget by doubling the number of w-sample blocks, up to a maximum of L samples. In the original paper, w= 4 , k= 32 , and L= 40 . We adopt the same settings in our implementation. DeepConf (Fu et al., 2...

work page 2025