Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Dongmei Zhang; Fangkai Yang; Furu Wei; Lu Wang; Pu Zhao; Qibin Wang; Qingwei Lin; Saravan Rajmohan; Shaohan Huang

arxiv: 2509.00084 · v2 · submitted 2025-08-27 · 💻 cs.LG · cs.AI· cs.CL

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Qibin Wang , Pu Zhao , Shaohan Huang , Fangkai Yang , Lu Wang , Furu Wei , Qingwei Lin , Saravan Rajmohan

show 1 more author

Dongmei Zhang

This is my paper

Pith reviewed 2026-05-18 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords self-refinementparallel reasoningtest-time scalingLLM reasoningmathematical benchmarksmodel transfergenerative refinementmajority voting

0 comments

The pith

A single model can learn to both generate parallel reasoning candidates and refine them into a better answer by transferring that skill from larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the benefit of self-refinement over simple majority voting among multiple reasoning attempts grows steadily as models become larger, yet this benefit depends little on how accurate the model is at base. By defining the Refinement Gap as the extra gain from refinement, the authors show this quantity supports a transfer process in which larger models teach smaller ones how to improve on a set of candidates. They introduce Generative Self-Refinement, which trains one model jointly on candidate generation and on producing a refined answer conditioned on those candidates. Experiments then demonstrate that this approach surpasses other parallel aggregation techniques on five mathematical benchmarks and that the learned refinement ability carries over to different model families and to new problem distributions. A sympathetic reader would care because many current test-time methods waste compute when every candidate is wrong; teaching the model to fix the collective output offers a way to keep improving without needing every path to be correct on its own.

Core claim

The central claim is that the Refinement Gap, which measures how much self-refinement improves upon majority voting, follows a clear scaling trend with model size while correlating only weakly with base capability. This separation allows the refinement policy to be transferred from larger teacher models into smaller student models through joint training of a single model that both produces strong parallel candidates and refines a superior final answer from them.

What carries the argument

Generative Self-Refinement (GSR), a parallel test-time scaling method that jointly trains one model to generate candidates and to synthesize a refined answer conditioned on those candidates by distilling the refinement behavior observed in larger models that exhibit higher Refinement Gap.

If this is right

The method reaches state-of-the-art results across five mathematical benchmarks compared with other parallel aggregation approaches.
The learned refinement skill transfers across multiple model scales and families.
The approach shows robust generalization when tested on an out-of-distribution domain.
Joint training lets the model improve both the quality of its candidate solutions and its ability to refine them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller models could close part of the performance gap with much larger models on reasoning tasks by acquiring this refinement capability without adding parameters.
The Refinement Gap could serve as a diagnostic to decide in advance which models are likely to benefit from self-refinement at test time.
The same joint-training pattern might be applied to reasoning domains beyond mathematics, such as code generation or scientific problem solving.

Load-bearing premise

The Refinement Gap must increase reliably with model size while staying only weakly tied to the model's basic accuracy, so that the refinement policy can be successfully transferred to smaller models through joint training.

What would settle it

Run GSR on a small student model and observe whether its final accuracy on the five mathematical benchmarks fails to exceed majority voting or drops below the performance of the unrefined student.

Figures

Figures reproduced from arXiv: 2509.00084 by Dongmei Zhang, Fangkai Yang, Furu Wei, Lu Wang, Pu Zhao, Qibin Wang, Qingwei Lin, Saravan Rajmohan, Shaohan Huang.

**Figure 2.** Figure 2: An overview of our method. We introduce Generative Self-Refinement (GSR), a framework that generates a superior final answer by selectively leveraging insights from multiple parallel candidate solutions generated by itself. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An overview of the hybrid training pipeline, which consists of a data generation stage followed by a supervised fine [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template for our method. Full prompt tem [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Experiments results for our model and its base on a [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 5.** Figure 5: Accuracy of input scaling on the performance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Average token counts of model outputs across five benchmarks, showing the breakdown between the Thinking (solid) [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Parallel self-refinement, generating multiple candidates and synthesizing a refined answer conditioned on them, offers a promising alternative, but the underlying mechanism driving its effectiveness remains obscure. To bridge this gap in understanding, we introduce a new metric, the Refinement Gap, designed to quantify the relative improvement of self-refinement beyond majority voting. We show that the Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability. Based on this discovery, we propose Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students. Crucially, GSR jointly trains a single model to generate strong candidates and refine a better final answer based on these candidates. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks over other parallel aggregation methods, while the learned refinement skill transfers across multiple model scales and families and exhibits robust generalization to an out-of-distribution domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSR adds a Refinement Gap metric and joint training for parallel self-refinement, but the transfer claim needs ablations to separate refinement skill from better candidate generation.

read the letter

The main thing to know is that the paper defines a Refinement Gap to measure the extra lift from self-refinement over simple majority voting in parallel test-time scaling, finds that this gap grows with model size while staying loosely tied to raw capability, and then builds a joint training setup called GSR to move the refinement ability from large models to small ones. What they do well is spot a real limitation in existing parallel methods like best-of-N or voting: they fail completely if every candidate is wrong. The joint training of candidate generation and refinement in one model is a practical step, and they claim it delivers state-of-the-art numbers on five math benchmarks while showing the skill carries over to other model families and even an out-of-distribution setting. The soft spot is that the abstract gives no experimental details, no ablation tables, and no statistical checks. The stress-test concern lands because the method trains everything together, so any improvement could just come from the model learning to produce better initial candidates rather than from distilling a separate refinement policy. Without a clean comparison that holds candidate quality fixed and only varies the refinement head, or a pure distillation baseline, it's difficult to credit the scaling observation with enabling the cross-scale transfer. The weak correlation claim also needs the actual data to judge how weak it really is. This paper is for groups already working on test-time scaling and ways to make smaller LLMs more reliable on reasoning tasks. A reader who wants concrete ideas for improving parallel aggregation would find the metric and the joint objective useful to try. It has enough of a new angle and a clear empirical hook that it deserves a serious referee, mainly to verify the experiments and push for the missing controls. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Refinement Gap metric to quantify the improvement of parallel self-refinement over majority voting in LLM reasoning. It reports that this gap scales with model size while correlating only weakly with base capability. Building on this, the authors propose Generative Self-Refinement (GSR), a framework that jointly trains a model to generate candidate solutions and refine a final answer from them, transferring the refinement policy from larger teacher models to smaller student models. The central empirical claims are state-of-the-art results on five mathematical benchmarks relative to other parallel aggregation methods, successful cross-scale and cross-family transfer of the refinement skill, and robust generalization to an out-of-distribution domain.

Significance. If the scaling observation and transfer results hold after proper controls, the work offers a concrete mechanism for improving test-time scaling on smaller models by distilling refinement behavior from larger ones, moving beyond simple aggregation methods like Best-of-N or voting. The introduction of the Refinement Gap as a diagnostic tool and the joint-training formulation are potentially useful contributions to understanding and engineering parallel reasoning in LLMs.

major comments (2)

[Method and Experiments] The justification for GSR rests on the claim that the Refinement Gap enables cross-scale policy transfer because it scales with size yet is only weakly correlated with base capability. However, because GSR jointly optimizes candidate generation and refinement within a single model, the reported gains on the five benchmarks could arise from improved candidate quality under the joint objective rather than from distilling a size-dependent refinement skill. An ablation that isolates the refinement head (e.g., freezing candidate generation or comparing against pure distillation of teacher refinements) is needed to establish that the observed transfer is attributable to the gap rather than joint training effects.
[Abstract and Experimental Results] The abstract and experimental claims assert SOTA performance and robust transfer/generalization, yet the provided text supplies no details on the exact baselines, number of candidates, statistical tests, variance across runs, or ablation tables that would allow evaluation of whether the gains are load-bearing for the transfer hypothesis.

minor comments (2)

[Introduction] Notation for the Refinement Gap should be defined with an explicit formula early in the paper rather than introduced descriptively.
[Figures and Tables] Figure captions and table headers would benefit from explicit statements of the number of samples, seeds, and confidence intervals used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the contributions of the Refinement Gap and Generative Self-Refinement (GSR). We respond to each major comment below and outline revisions to strengthen the empirical support for our claims about refinement skill transfer.

read point-by-point responses

Referee: [Method and Experiments] The justification for GSR rests on the claim that the Refinement Gap enables cross-scale policy transfer because it scales with size yet is only weakly correlated with base capability. However, because GSR jointly optimizes candidate generation and refinement within a single model, the reported gains on the five benchmarks could arise from improved candidate quality under the joint objective rather than from distilling a size-dependent refinement skill. An ablation that isolates the refinement head (e.g., freezing candidate generation or comparing against pure distillation of teacher refinements) is needed to establish that the observed transfer is attributable to the gap rather than joint training effects.

Authors: We agree this is a valid concern and that joint optimization could improve candidate quality as a side effect. Our transfer results across scales and model families, combined with the Refinement Gap's scaling behavior and weak correlation to base accuracy, provide initial evidence that the refinement policy is the transferable component. However, to more rigorously isolate this, we will add two new ablations in the revised version: (1) a controlled experiment freezing the candidate-generation parameters while fine-tuning only the refinement component on teacher-generated candidates, and (2) a direct comparison against pure distillation of teacher refinement outputs without the joint objective. These will be reported alongside the existing cross-scale and cross-family transfer tables. revision: yes
Referee: [Abstract and Experimental Results] The abstract and experimental claims assert SOTA performance and robust transfer/generalization, yet the provided text supplies no details on the exact baselines, number of candidates, statistical tests, variance across runs, or ablation tables that would allow evaluation of whether the gains are load-bearing for the transfer hypothesis.

Authors: We appreciate the feedback on presentation. The full manuscript (Section 4 and Appendix) specifies the baselines (majority voting, Best-of-N, and other parallel aggregation methods), uses 8–16 candidates per problem, reports means and standard deviations over multiple runs, and includes ablation tables on transfer. Statistical comparisons are provided where differences are discussed. To address the concern directly, we will revise the abstract to include concise references to these experimental settings and add a short paragraph in the main text summarizing variance and controls. This will make the load-bearing nature of the transfer results clearer without altering the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling observation and standard transfer learning

full rationale

The paper introduces the Refinement Gap as an empirical metric quantifying self-refinement improvement over majority voting, reports its scaling with model size and weak correlation to base capability from experiments, and proposes GSR as joint training for candidate generation plus refinement with policy transfer from larger to smaller models. No derivation, equation, or load-bearing claim reduces by construction to a fitted input, self-definition, or self-citation chain; results are validated externally on five math benchmarks and out-of-distribution domains. The central claims rest on observed scaling trends and standard transfer learning rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical discovery that refinement ability scales with model size independently of base capability and on the assumption that this ability can be distilled via joint training. No explicit free parameters or invented physical entities are stated.

axioms (1)

domain assumption Refinement Gap scales clearly with model size and is only weakly correlated with base capability.
Invoked to justify transferring the refinement policy from teacher to student models.

invented entities (1)

Refinement Gap no independent evidence
purpose: Quantify relative improvement of self-refinement over majority voting.
New metric introduced to analyze the effectiveness of parallel self-refinement.

pith-pipeline@v0.9.0 · 5785 in / 1313 out tokens · 41579 ms · 2026-05-18T21:12:54.014903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
cs.LG 2026-05 unverdicted novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Ahmadian, A.; Cremer, C.; Gall \' e , M.; Fadaee, M.; Kreutzer, J.; Pietquin, O.; \" U st \" u n, A.; and Hooker, S. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. In Proceedings of ACL 2024 , 12248--12267. Association for Computational Linguistics

work page 2024
[4]

AI-MO. 2024 a . AIMO Validation AIME Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-aime. Accessed: 2025-03-29

work page 2024
[5]

AI-MO. 2024 b . AIMO Validation AMC Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-amc. Accessed: 2025-03-29

work page 2024
[6]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B.; Juravsky, J.; Ehrlich, R.; Clark, R.; Le, Q. V.; Ré, C.; and Mirhoseini, A. 2024. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. arXiv:2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

Chen, X.; Li, G.; Wang, Z.; Jin, B.; Qian, C.; Wang, Y.; Wang, H.; Zhang, Y.; Zhang, D.; Zhang, T.; Tong, H.; and Ji, H. 2025. RM-R1: Reward Modeling as Reasoning. arXiv:2505.02387

work page arXiv 2025
[9]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Guo, J.; Chi, Z.; Dong, L.; Dong, Q.; Wu, X.; Huang, S.; and Wei, F. 2025. Reward Reasoning Model. arXiv:2505.14674

work page arXiv 2025
[11]

L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M

He, C.; Luo, R.; Bai, Y.; Hu, S.; Thai, Z. L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. In Proceedings of ACL 2024 , 3828--3850. Association for Computational Linguistics

work page 2024
[12]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In Proceedings of NeurIPS 2021

work page 2021
[13]

Hochlehnert, A.; Bhatnagar, H.; Udandarao, V.; Albanie, S.; Prabhu, A.; and Bethge, M. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv:2504.07086

work page arXiv 2025
[14]

Training Compute-Optimal Large Language Models

Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L. A.; Welbl, J.; Clark, A.; Hennigan, T.; Noland, E.; Millican, K.; van den Driessche, G.; Damoc, B.; Guy, A.; Osindero, S.; Simonyan, K.; Elsen, E.; Rae, J. W.; Vinyals, O.; and Sifre, L. 2022. Training Compute-Optimal Large Language Models. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Irvine, R.; Boubert, D.; Raina, V.; Liusie, A.; Zhu, Z.; Mudupalli, V.; Korshuk, A.; Liu, Z.; Cremer, F.; Assassi, V.; Beauchamp, C.-C.; Lu, X.; Rialan, T.; and Beauchamp, W. 2023. Rewarding Chatbots for Real-World Engagement with Millions of Users. arXiv:2303.06135

work page arXiv 2023
[16]

Jiang, D.; Ren, X.; and Lin, B. Y. 2023. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of ACL 2023 , 14165--14178. Association for Computational Linguistics

work page 2023
[17]

Scaling Laws for Neural Language Models

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Kim, G.; Baldi, P.; and McAleer, S. 2023. Language Models can Solve Computer Tasks. In Proceedings of NeurIPS 2023

work page 2023
[19]

Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M

Kim, S.; Suk, J.; Longpre, S.; Lin, B. Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M. 2024. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of EMNLP 2024 , 4334--4353. Association for Computational Linguistics

work page 2024
[20]

Lin, Y.-T. 2025. AIME 2025 Dataset. https://huggingface.co/datasets/yentinglin/aime_2025. Accessed: 2025-03-29

work page 2025
[21]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Liu, C. Y.; Zeng, L.; Liu, J.; Yan, R.; He, J.; Wang, C.; Yan, S.; Liu, Y.; and Zhou, Y. 2024. Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs. arXiv:2410.18451

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P

Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; Gupta, S.; Majumder, B. P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P. 2023. Self-Refine: Iterative Refinement with Self-Feedback. In Proceedings of NeurIPS 2023

work page 2023
[23]

Generative reward models

Mahan, D.; Phung, D. V.; Rafailov, R.; Blagden, C.; Lile, N.; Castricato, L.; Fränken, J.-P.; Finn, C.; and Albalak, A. 2024. Generative Reward Models. arXiv:2410.12832

work page arXiv 2024
[24]

Moshkov, I.; Hanley, D.; Sorokin, I.; Toshniwal, S.; Henkel, C.; Schifferer, B.; Du, W.; and Gitman, I. 2025. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv:2504.16891

work page arXiv 2025
[25]

Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.; Knight, M.; Chess, B.; and Schulman, J. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Paul, D.; Ismayilzada, M.; Peyrard, M.; Borges, B.; Bosselut, A.; West, R.; and Faltings, B. 2024. REFINER: Reasoning Feedback on Intermediate Representations. In Proceedings of EACL 2024 , 1100--1126. Association for Computational Linguistics

work page 2024
[27]

QwenTeam. 2024. QwQ: Reflect Deeply on the Boundaries of the Unknown

work page 2024
[28]

QwenTeam. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

work page 2025
[29]

B.; Finn, C.; and Niekum, S

Rafailov, R.; Chittepu, Y.; Park, R.; Sikchi, H.; Hejna, J.; Knox, W. B.; Finn, C.; and Niekum, S. 2024. Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms. In Proceedings of NeurIPS 2024

work page 2024
[30]

Snell, C.; Lee, J.; Xu, K.; and Kumar, A. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Song, Y.; Wang, G.; Li, S.; and Lin, B. Y. 2025. The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism. In Proceedings of NAACL 2025 , 4195--4206. Association for Computational Linguistics

work page 2025
[32]

M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P

Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D. M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P. F. 2020. Learning to summarize with human feedback. In Proceedings of NeurIPS 2020

work page 2020
[33]

Vernikos, G.; Brazinskas, A.; Ad \' a mek, J.; Mallinson, J.; Severyn, A.; and Malmi, E. 2024. Small Language Models Improve Giants by Rewriting Their Outputs. In Proceedings of EACL 2024 , 2703--2718. Association for Computational Linguistics

work page 2024
[34]

V.; Chi, E

Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of ICLR 2023

work page 2023
[35]

Whitehouse, C.; Wang, T.; Yu, P.; Li, X.; Weston, J.; Kulikov, I.; and Saha, S. 2025. J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320

work page arXiv 2025
[36]

I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D

Winata, G. I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D. T. 2025. MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences. In Proceedings of ICLR 2025

work page 2025
[37]

Wu, Y.; Sun, Z.; Li, S.; Welleck, S.; and Yang, Y. 2025. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving. In Proceedings of ICLR 2025 . OpenReview.net

work page 2025
[38]

On memorization of large language models in logical reasoning

Xie, C.; Huang, Y.; Zhang, C.; Yu, D.; Chen, X.; Lin, B. Y.; Li, B.; Ghazi, B.; and Kumar, R. 2025. On Memorization of Large Language Models in Logical Reasoning. arXiv:2410.23123

work page arXiv 2025
[39]

Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; Zheng, C.; Liu, D.; Zhou, F.; Huang, F.; Hu, F.; Ge, H.; Wei, H.; Lin, H.; Tang, J.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Zhou, J.; Lin, J.; Dang, K.; Bao, K.; Yang, K.; Yu, L.; Deng, L.; Li, M.; Xue, M.; Li, M.; Zhang, P.; Wang, P.; Zhu, Q...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; Lin, H.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Lin, J.; Dang, K.; Lu, K.; Bao, K.; Yang, K.; Yu, L.; Li, M.; Xue, M.; Zhang, P.; Zhu, Q.; Men, R.; Lin, R.; Li, T.; Tang, T.; Xia, T.; Ren, X.; Ren, X.; Fan, Y.; Su, Y.; Zhang, Y.; Wan, Y.; Li...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Yang, L.; Yu, Z.; Cui, B.; and Wang, M. 2025 c . ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates. arXiv:2502.06772

work page arXiv 2025
[42]

Zhang, B.; Zhang, X.; Zhang, J.; Yu, J.; Luo, S.; and Tang, J. 2025 a . CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis. arXiv:2501.01668

work page arXiv 2025
[43]

Zhang, Q.; Lyu, F.; Sun, Z.; Wang, L.; Zhang, W.; Hua, W.; Wu, H.; Guo, Z.; Wang, Y.; Muennighoff, N.; King, I.; Liu, X.; and Ma, C. 2025 b . A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? arXiv:2503.24235

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Ahmadian, A.; Cremer, C.; Gall \' e , M.; Fadaee, M.; Kreutzer, J.; Pietquin, O.; \" U st \" u n, A.; and Hooker, S. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. In Proceedings of ACL 2024 , 12248--12267. Association for Computational Linguistics

work page 2024

[4] [4]

AI-MO. 2024 a . AIMO Validation AIME Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-aime. Accessed: 2025-03-29

work page 2024

[5] [5]

AI-MO. 2024 b . AIMO Validation AMC Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-amc. Accessed: 2025-03-29

work page 2024

[6] [6]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B.; Juravsky, J.; Ehrlich, R.; Clark, R.; Le, Q. V.; Ré, C.; and Mirhoseini, A. 2024. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. arXiv:2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

Chen, X.; Li, G.; Wang, Z.; Jin, B.; Qian, C.; Wang, Y.; Wang, H.; Zhang, Y.; Zhang, D.; Zhang, T.; Tong, H.; and Ji, H. 2025. RM-R1: Reward Modeling as Reasoning. arXiv:2505.02387

work page arXiv 2025

[9] [9]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Guo, J.; Chi, Z.; Dong, L.; Dong, Q.; Wu, X.; Huang, S.; and Wei, F. 2025. Reward Reasoning Model. arXiv:2505.14674

work page arXiv 2025

[11] [11]

L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M

He, C.; Luo, R.; Bai, Y.; Hu, S.; Thai, Z. L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. In Proceedings of ACL 2024 , 3828--3850. Association for Computational Linguistics

work page 2024

[12] [12]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In Proceedings of NeurIPS 2021

work page 2021

[13] [13]

Hochlehnert, A.; Bhatnagar, H.; Udandarao, V.; Albanie, S.; Prabhu, A.; and Bethge, M. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv:2504.07086

work page arXiv 2025

[14] [14]

Training Compute-Optimal Large Language Models

Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L. A.; Welbl, J.; Clark, A.; Hennigan, T.; Noland, E.; Millican, K.; van den Driessche, G.; Damoc, B.; Guy, A.; Osindero, S.; Simonyan, K.; Elsen, E.; Rae, J. W.; Vinyals, O.; and Sifre, L. 2022. Training Compute-Optimal Large Language Models. ar...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Irvine, R.; Boubert, D.; Raina, V.; Liusie, A.; Zhu, Z.; Mudupalli, V.; Korshuk, A.; Liu, Z.; Cremer, F.; Assassi, V.; Beauchamp, C.-C.; Lu, X.; Rialan, T.; and Beauchamp, W. 2023. Rewarding Chatbots for Real-World Engagement with Millions of Users. arXiv:2303.06135

work page arXiv 2023

[16] [16]

Jiang, D.; Ren, X.; and Lin, B. Y. 2023. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of ACL 2023 , 14165--14178. Association for Computational Linguistics

work page 2023

[17] [17]

Scaling Laws for Neural Language Models

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[18] [18]

Kim, G.; Baldi, P.; and McAleer, S. 2023. Language Models can Solve Computer Tasks. In Proceedings of NeurIPS 2023

work page 2023

[19] [19]

Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M

Kim, S.; Suk, J.; Longpre, S.; Lin, B. Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M. 2024. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of EMNLP 2024 , 4334--4353. Association for Computational Linguistics

work page 2024

[20] [20]

Lin, Y.-T. 2025. AIME 2025 Dataset. https://huggingface.co/datasets/yentinglin/aime_2025. Accessed: 2025-03-29

work page 2025

[21] [21]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Liu, C. Y.; Zeng, L.; Liu, J.; Yan, R.; He, J.; Wang, C.; Yan, S.; Liu, Y.; and Zhou, Y. 2024. Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs. arXiv:2410.18451

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P

Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; Gupta, S.; Majumder, B. P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P. 2023. Self-Refine: Iterative Refinement with Self-Feedback. In Proceedings of NeurIPS 2023

work page 2023

[23] [23]

Generative reward models

Mahan, D.; Phung, D. V.; Rafailov, R.; Blagden, C.; Lile, N.; Castricato, L.; Fränken, J.-P.; Finn, C.; and Albalak, A. 2024. Generative Reward Models. arXiv:2410.12832

work page arXiv 2024

[24] [24]

Moshkov, I.; Hanley, D.; Sorokin, I.; Toshniwal, S.; Henkel, C.; Schifferer, B.; Du, W.; and Gitman, I. 2025. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv:2504.16891

work page arXiv 2025

[25] [25]

Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.; Knight, M.; Chess, B.; and Schulman, J. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Paul, D.; Ismayilzada, M.; Peyrard, M.; Borges, B.; Bosselut, A.; West, R.; and Faltings, B. 2024. REFINER: Reasoning Feedback on Intermediate Representations. In Proceedings of EACL 2024 , 1100--1126. Association for Computational Linguistics

work page 2024

[27] [27]

QwenTeam. 2024. QwQ: Reflect Deeply on the Boundaries of the Unknown

work page 2024

[28] [28]

QwenTeam. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

work page 2025

[29] [29]

B.; Finn, C.; and Niekum, S

Rafailov, R.; Chittepu, Y.; Park, R.; Sikchi, H.; Hejna, J.; Knox, W. B.; Finn, C.; and Niekum, S. 2024. Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms. In Proceedings of NeurIPS 2024

work page 2024

[30] [30]

Snell, C.; Lee, J.; Xu, K.; and Kumar, A. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Song, Y.; Wang, G.; Li, S.; and Lin, B. Y. 2025. The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism. In Proceedings of NAACL 2025 , 4195--4206. Association for Computational Linguistics

work page 2025

[32] [32]

M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P

Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D. M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P. F. 2020. Learning to summarize with human feedback. In Proceedings of NeurIPS 2020

work page 2020

[33] [33]

Vernikos, G.; Brazinskas, A.; Ad \' a mek, J.; Mallinson, J.; Severyn, A.; and Malmi, E. 2024. Small Language Models Improve Giants by Rewriting Their Outputs. In Proceedings of EACL 2024 , 2703--2718. Association for Computational Linguistics

work page 2024

[34] [34]

V.; Chi, E

Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of ICLR 2023

work page 2023

[35] [35]

Whitehouse, C.; Wang, T.; Yu, P.; Li, X.; Weston, J.; Kulikov, I.; and Saha, S. 2025. J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320

work page arXiv 2025

[36] [36]

I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D

Winata, G. I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D. T. 2025. MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences. In Proceedings of ICLR 2025

work page 2025

[37] [37]

Wu, Y.; Sun, Z.; Li, S.; Welleck, S.; and Yang, Y. 2025. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving. In Proceedings of ICLR 2025 . OpenReview.net

work page 2025

[38] [38]

On memorization of large language models in logical reasoning

Xie, C.; Huang, Y.; Zhang, C.; Yu, D.; Chen, X.; Lin, B. Y.; Li, B.; Ghazi, B.; and Kumar, R. 2025. On Memorization of Large Language Models in Logical Reasoning. arXiv:2410.23123

work page arXiv 2025

[39] [39]

Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; Zheng, C.; Liu, D.; Zhou, F.; Huang, F.; Hu, F.; Ge, H.; Wei, H.; Lin, H.; Tang, J.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Zhou, J.; Lin, J.; Dang, K.; Bao, K.; Yang, K.; Yu, L.; Deng, L.; Li, M.; Xue, M.; Li, M.; Zhang, P.; Wang, P.; Zhu, Q...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; Lin, H.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Lin, J.; Dang, K.; Lu, K.; Bao, K.; Yang, K.; Yu, L.; Li, M.; Xue, M.; Zhang, P.; Zhu, Q.; Men, R.; Lin, R.; Li, T.; Tang, T.; Xia, T.; Ren, X.; Ren, X.; Fan, Y.; Su, Y.; Zhang, Y.; Wan, Y.; Li...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Yang, L.; Yu, Z.; Cui, B.; and Wang, M. 2025 c . ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates. arXiv:2502.06772

work page arXiv 2025

[42] [42]

Zhang, B.; Zhang, X.; Zhang, J.; Yu, J.; Luo, S.; and Tang, J. 2025 a . CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis. arXiv:2501.01668

work page arXiv 2025

[43] [43]

Zhang, Q.; Lyu, F.; Sun, Z.; Wang, L.; Zhang, W.; Hua, W.; Wu, H.; Guo, Z.; Wang, Y.; Muennighoff, N.; King, I.; Liu, X.; and Ma, C. 2025 b . A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? arXiv:2503.24235

work page internal anchor Pith review Pith/arXiv arXiv 2025