pith. sign in

arxiv: 2509.00084 · v2 · submitted 2025-08-27 · 💻 cs.LG · cs.AI· cs.CL

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

Pith reviewed 2026-05-18 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords self-refinementparallel reasoningtest-time scalingLLM reasoningmathematical benchmarksmodel transfergenerative refinementmajority voting
0
0 comments X

The pith

A single model can learn to both generate parallel reasoning candidates and refine them into a better answer by transferring that skill from larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the benefit of self-refinement over simple majority voting among multiple reasoning attempts grows steadily as models become larger, yet this benefit depends little on how accurate the model is at base. By defining the Refinement Gap as the extra gain from refinement, the authors show this quantity supports a transfer process in which larger models teach smaller ones how to improve on a set of candidates. They introduce Generative Self-Refinement, which trains one model jointly on candidate generation and on producing a refined answer conditioned on those candidates. Experiments then demonstrate that this approach surpasses other parallel aggregation techniques on five mathematical benchmarks and that the learned refinement ability carries over to different model families and to new problem distributions. A sympathetic reader would care because many current test-time methods waste compute when every candidate is wrong; teaching the model to fix the collective output offers a way to keep improving without needing every path to be correct on its own.

Core claim

The central claim is that the Refinement Gap, which measures how much self-refinement improves upon majority voting, follows a clear scaling trend with model size while correlating only weakly with base capability. This separation allows the refinement policy to be transferred from larger teacher models into smaller student models through joint training of a single model that both produces strong parallel candidates and refines a superior final answer from them.

What carries the argument

Generative Self-Refinement (GSR), a parallel test-time scaling method that jointly trains one model to generate candidates and to synthesize a refined answer conditioned on those candidates by distilling the refinement behavior observed in larger models that exhibit higher Refinement Gap.

If this is right

  • The method reaches state-of-the-art results across five mathematical benchmarks compared with other parallel aggregation approaches.
  • The learned refinement skill transfers across multiple model scales and families.
  • The approach shows robust generalization when tested on an out-of-distribution domain.
  • Joint training lets the model improve both the quality of its candidate solutions and its ability to refine them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Smaller models could close part of the performance gap with much larger models on reasoning tasks by acquiring this refinement capability without adding parameters.
  • The Refinement Gap could serve as a diagnostic to decide in advance which models are likely to benefit from self-refinement at test time.
  • The same joint-training pattern might be applied to reasoning domains beyond mathematics, such as code generation or scientific problem solving.

Load-bearing premise

The Refinement Gap must increase reliably with model size while staying only weakly tied to the model's basic accuracy, so that the refinement policy can be successfully transferred to smaller models through joint training.

What would settle it

Run GSR on a small student model and observe whether its final accuracy on the five mathematical benchmarks fails to exceed majority voting or drops below the performance of the unrefined student.

Figures

Figures reproduced from arXiv: 2509.00084 by Dongmei Zhang, Fangkai Yang, Furu Wei, Lu Wang, Pu Zhao, Qibin Wang, Qingwei Lin, Saravan Rajmohan, Shaohan Huang.

Figure 1
Figure 1. Figure 1: An example of our method. The model is provided [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our method. We introduce Generative Self-Refinement (GSR), a framework that generates a superior final answer by selec￾tively leveraging insights from multiple parallel candidate solutions generated by itself. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of the hybrid training pipeline, which consists of a data generation stage followed by a supervised fine [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for our method. Full prompt tem [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experiments results for our model and its base on a [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of input scaling on the performance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average token counts of model outputs across five benchmarks, showing the breakdown between the Thinking (solid) [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Test-time scaling (TTS) has gained widespread attention for enhancing LLM reasoning. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Parallel self-refinement, generating multiple candidates and synthesizing a refined answer conditioned on them, offers a promising alternative, but the underlying mechanism driving its effectiveness remains obscure. To bridge this gap in understanding, we introduce a new metric, the Refinement Gap, designed to quantify the relative improvement of self-refinement beyond majority voting. We show that the Refinement Gap exhibits a clear scaling trend with model size and is only weakly correlated with the base capability. Based on this discovery, we propose Generative Self-Refinement (GSR), a parallel test-time scaling framework that transfers the refinement policy from larger teacher models with higher refinement gap into smaller students. Crucially, GSR jointly trains a single model to generate strong candidates and refine a better final answer based on these candidates. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks over other parallel aggregation methods, while the learned refinement skill transfers across multiple model scales and families and exhibits robust generalization to an out-of-distribution domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Refinement Gap metric to quantify the improvement of parallel self-refinement over majority voting in LLM reasoning. It reports that this gap scales with model size while correlating only weakly with base capability. Building on this, the authors propose Generative Self-Refinement (GSR), a framework that jointly trains a model to generate candidate solutions and refine a final answer from them, transferring the refinement policy from larger teacher models to smaller student models. The central empirical claims are state-of-the-art results on five mathematical benchmarks relative to other parallel aggregation methods, successful cross-scale and cross-family transfer of the refinement skill, and robust generalization to an out-of-distribution domain.

Significance. If the scaling observation and transfer results hold after proper controls, the work offers a concrete mechanism for improving test-time scaling on smaller models by distilling refinement behavior from larger ones, moving beyond simple aggregation methods like Best-of-N or voting. The introduction of the Refinement Gap as a diagnostic tool and the joint-training formulation are potentially useful contributions to understanding and engineering parallel reasoning in LLMs.

major comments (2)
  1. [Method and Experiments] The justification for GSR rests on the claim that the Refinement Gap enables cross-scale policy transfer because it scales with size yet is only weakly correlated with base capability. However, because GSR jointly optimizes candidate generation and refinement within a single model, the reported gains on the five benchmarks could arise from improved candidate quality under the joint objective rather than from distilling a size-dependent refinement skill. An ablation that isolates the refinement head (e.g., freezing candidate generation or comparing against pure distillation of teacher refinements) is needed to establish that the observed transfer is attributable to the gap rather than joint training effects.
  2. [Abstract and Experimental Results] The abstract and experimental claims assert SOTA performance and robust transfer/generalization, yet the provided text supplies no details on the exact baselines, number of candidates, statistical tests, variance across runs, or ablation tables that would allow evaluation of whether the gains are load-bearing for the transfer hypothesis.
minor comments (2)
  1. [Introduction] Notation for the Refinement Gap should be defined with an explicit formula early in the paper rather than introduced descriptively.
  2. [Figures and Tables] Figure captions and table headers would benefit from explicit statements of the number of samples, seeds, and confidence intervals used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the contributions of the Refinement Gap and Generative Self-Refinement (GSR). We respond to each major comment below and outline revisions to strengthen the empirical support for our claims about refinement skill transfer.

read point-by-point responses
  1. Referee: [Method and Experiments] The justification for GSR rests on the claim that the Refinement Gap enables cross-scale policy transfer because it scales with size yet is only weakly correlated with base capability. However, because GSR jointly optimizes candidate generation and refinement within a single model, the reported gains on the five benchmarks could arise from improved candidate quality under the joint objective rather than from distilling a size-dependent refinement skill. An ablation that isolates the refinement head (e.g., freezing candidate generation or comparing against pure distillation of teacher refinements) is needed to establish that the observed transfer is attributable to the gap rather than joint training effects.

    Authors: We agree this is a valid concern and that joint optimization could improve candidate quality as a side effect. Our transfer results across scales and model families, combined with the Refinement Gap's scaling behavior and weak correlation to base accuracy, provide initial evidence that the refinement policy is the transferable component. However, to more rigorously isolate this, we will add two new ablations in the revised version: (1) a controlled experiment freezing the candidate-generation parameters while fine-tuning only the refinement component on teacher-generated candidates, and (2) a direct comparison against pure distillation of teacher refinement outputs without the joint objective. These will be reported alongside the existing cross-scale and cross-family transfer tables. revision: yes

  2. Referee: [Abstract and Experimental Results] The abstract and experimental claims assert SOTA performance and robust transfer/generalization, yet the provided text supplies no details on the exact baselines, number of candidates, statistical tests, variance across runs, or ablation tables that would allow evaluation of whether the gains are load-bearing for the transfer hypothesis.

    Authors: We appreciate the feedback on presentation. The full manuscript (Section 4 and Appendix) specifies the baselines (majority voting, Best-of-N, and other parallel aggregation methods), uses 8–16 candidates per problem, reports means and standard deviations over multiple runs, and includes ablation tables on transfer. Statistical comparisons are provided where differences are discussed. To address the concern directly, we will revise the abstract to include concise references to these experimental settings and add a short paragraph in the main text summarizing variance and controls. This will make the load-bearing nature of the transfer results clearer without altering the core claims. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical scaling observation and standard transfer learning

full rationale

The paper introduces the Refinement Gap as an empirical metric quantifying self-refinement improvement over majority voting, reports its scaling with model size and weak correlation to base capability from experiments, and proposes GSR as joint training for candidate generation plus refinement with policy transfer from larger to smaller models. No derivation, equation, or load-bearing claim reduces by construction to a fitted input, self-definition, or self-citation chain; results are validated externally on five math benchmarks and out-of-distribution domains. The central claims rest on observed scaling trends and standard transfer learning rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical discovery that refinement ability scales with model size independently of base capability and on the assumption that this ability can be distilled via joint training. No explicit free parameters or invented physical entities are stated.

axioms (1)
  • domain assumption Refinement Gap scales clearly with model size and is only weakly correlated with base capability.
    Invoked to justify transferring the refinement policy from teacher to student models.
invented entities (1)
  • Refinement Gap no independent evidence
    purpose: Quantify relative improvement of self-refinement over majority voting.
    New metric introduced to analyze the effectiveness of parallel self-refinement.

pith-pipeline@v0.9.0 · 5785 in / 1313 out tokens · 41579 ms · 2026-05-18T21:12:54.014903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Ahmadian, A.; Cremer, C.; Gall \' e , M.; Fadaee, M.; Kreutzer, J.; Pietquin, O.; \" U st \" u n, A.; and Hooker, S. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. In Proceedings of ACL 2024 , 12248--12267. Association for Computational Linguistics

  4. [4]

    AI-MO. 2024 a . AIMO Validation AIME Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-aime. Accessed: 2025-03-29

  5. [5]

    AI-MO. 2024 b . AIMO Validation AMC Dataset . https://huggingface.co/datasets/AI-MO/aimo-validation-amc. Accessed: 2025-03-29

  6. [6]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Brown, B.; Juravsky, J.; Ehrlich, R.; Clark, R.; Le, Q. V.; Ré, C.; and Mirhoseini, A. 2024. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. arXiv:2407.21787

  7. [7]

    Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, ...

  8. [8]

    Chen, X.; Li, G.; Wang, Z.; Jin, B.; Qian, C.; Wang, Y.; Wang, H.; Zhang, Y.; Zhang, D.; Zhang, T.; Tong, H.; and Ji, H. 2025. RM-R1: Reward Modeling as Reasoning. arXiv:2505.02387

  9. [9]

    Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168

  10. [10]

    Guo, J.; Chi, Z.; Dong, L.; Dong, Q.; Wu, X.; Huang, S.; and Wei, F. 2025. Reward Reasoning Model. arXiv:2505.14674

  11. [11]

    L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M

    He, C.; Luo, R.; Bai, Y.; Hu, S.; Thai, Z. L.; Shen, J.; Hu, J.; Han, X.; Huang, Y.; Zhang, Y.; Liu, J.; Qi, L.; Liu, Z.; and Sun, M. 2024. OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems. In Proceedings of ACL 2024 , 3828--3850. Association for Computational Linguistics

  12. [12]

    Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. In Proceedings of NeurIPS 2021

  13. [13]

    Hochlehnert, A.; Bhatnagar, H.; Udandarao, V.; Albanie, S.; Prabhu, A.; and Bethge, M. 2025. A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility. arXiv:2504.07086

  14. [14]

    Training Compute-Optimal Large Language Models

    Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L. A.; Welbl, J.; Clark, A.; Hennigan, T.; Noland, E.; Millican, K.; van den Driessche, G.; Damoc, B.; Guy, A.; Osindero, S.; Simonyan, K.; Elsen, E.; Rae, J. W.; Vinyals, O.; and Sifre, L. 2022. Training Compute-Optimal Large Language Models. ar...

  15. [15]

    Irvine, R.; Boubert, D.; Raina, V.; Liusie, A.; Zhu, Z.; Mudupalli, V.; Korshuk, A.; Liu, Z.; Cremer, F.; Assassi, V.; Beauchamp, C.-C.; Lu, X.; Rialan, T.; and Beauchamp, W. 2023. Rewarding Chatbots for Real-World Engagement with Millions of Users. arXiv:2303.06135

  16. [16]

    Jiang, D.; Ren, X.; and Lin, B. Y. 2023. LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion. In Proceedings of ACL 2023 , 14165--14178. Association for Computational Linguistics

  17. [17]

    Scaling Laws for Neural Language Models

    Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361

  18. [18]

    Kim, G.; Baldi, P.; and McAleer, S. 2023. Language Models can Solve Computer Tasks. In Proceedings of NeurIPS 2023

  19. [19]

    Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M

    Kim, S.; Suk, J.; Longpre, S.; Lin, B. Y.; Shin, J.; Welleck, S.; Neubig, G.; Lee, M.; Lee, K.; and Seo, M. 2024. Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models. In Proceedings of EMNLP 2024 , 4334--4353. Association for Computational Linguistics

  20. [20]

    Lin, Y.-T. 2025. AIME 2025 Dataset. https://huggingface.co/datasets/yentinglin/aime_2025. Accessed: 2025-03-29

  21. [21]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Liu, C. Y.; Zeng, L.; Liu, J.; Yan, R.; He, J.; Wang, C.; Yan, S.; Liu, Y.; and Zhou, Y. 2024. Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs. arXiv:2410.18451

  22. [22]

    P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P

    Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; Gupta, S.; Majumder, B. P.; Hermann, K.; Welleck, S.; Yazdanbakhsh, A.; and Clark, P. 2023. Self-Refine: Iterative Refinement with Self-Feedback. In Proceedings of NeurIPS 2023

  23. [23]

    Generative reward models

    Mahan, D.; Phung, D. V.; Rafailov, R.; Blagden, C.; Lile, N.; Castricato, L.; Fränken, J.-P.; Finn, C.; and Albalak, A. 2024. Generative Reward Models. arXiv:2410.12832

  24. [24]

    Moshkov, I.; Hanley, D.; Sorokin, I.; Toshniwal, S.; Henkel, C.; Schifferer, B.; Du, W.; and Gitman, I. 2025. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv:2504.16891

  25. [25]

    Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.; Knight, M.; Chess, B.; and Schulman, J. 2022. WebGPT: Browser-assisted question-answering with human feedback. arXiv:2112.09332

  26. [26]

    Paul, D.; Ismayilzada, M.; Peyrard, M.; Borges, B.; Bosselut, A.; West, R.; and Faltings, B. 2024. REFINER: Reasoning Feedback on Intermediate Representations. In Proceedings of EACL 2024 , 1100--1126. Association for Computational Linguistics

  27. [27]

    QwenTeam. 2024. QwQ: Reflect Deeply on the Boundaries of the Unknown

  28. [28]

    QwenTeam. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning

  29. [29]

    B.; Finn, C.; and Niekum, S

    Rafailov, R.; Chittepu, Y.; Park, R.; Sikchi, H.; Hejna, J.; Knox, W. B.; Finn, C.; and Niekum, S. 2024. Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms. In Proceedings of NeurIPS 2024

  30. [30]

    Snell, C.; Lee, J.; Xu, K.; and Kumar, A. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314

  31. [31]

    Song, Y.; Wang, G.; Li, S.; and Lin, B. Y. 2025. The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism. In Proceedings of NAACL 2025 , 4195--4206. Association for Computational Linguistics

  32. [32]

    M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P

    Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D. M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; and Christiano, P. F. 2020. Learning to summarize with human feedback. In Proceedings of NeurIPS 2020

  33. [33]

    Vernikos, G.; Brazinskas, A.; Ad \' a mek, J.; Mallinson, J.; Severyn, A.; and Malmi, E. 2024. Small Language Models Improve Giants by Rewriting Their Outputs. In Proceedings of EACL 2024 , 2703--2718. Association for Computational Linguistics

  34. [34]

    V.; Chi, E

    Wang, X.; Wei, J.; Schuurmans, D.; Le, Q. V.; Chi, E. H.; Narang, S.; Chowdhery, A.; and Zhou, D. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of ICLR 2023

  35. [35]

    Whitehouse, C.; Wang, T.; Yu, P.; Li, X.; Weston, J.; Kulikov, I.; and Saha, S. 2025. J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning. arXiv:2505.10320

  36. [36]

    I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D

    Winata, G. I.; Anugraha, D.; Susanto, L.; Kuwanto, G.; and Wijaya, D. T. 2025. MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences. In Proceedings of ICLR 2025

  37. [37]

    Wu, Y.; Sun, Z.; Li, S.; Welleck, S.; and Yang, Y. 2025. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving. In Proceedings of ICLR 2025 . OpenReview.net

  38. [38]

    On memorization of large language models in logical reasoning

    Xie, C.; Huang, Y.; Zhang, C.; Yu, D.; Chen, X.; Lin, B. Y.; Li, B.; Ghazi, B.; and Kumar, R. 2025. On Memorization of Large Language Models in Logical Reasoning. arXiv:2410.23123

  39. [39]

    Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; Zheng, C.; Liu, D.; Zhou, F.; Huang, F.; Hu, F.; Ge, H.; Wei, H.; Lin, H.; Tang, J.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Zhou, J.; Lin, J.; Dang, K.; Bao, K.; Yang, K.; Yu, L.; Deng, L.; Li, M.; Xue, M.; Li, M.; Zhang, P.; Wang, P.; Zhu, Q...

  40. [40]

    Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; Lin, H.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Lin, J.; Dang, K.; Lu, K.; Bao, K.; Yang, K.; Yu, L.; Li, M.; Xue, M.; Zhang, P.; Zhu, Q.; Men, R.; Lin, R.; Li, T.; Tang, T.; Xia, T.; Ren, X.; Ren, X.; Fan, Y.; Su, Y.; Zhang, Y.; Wan, Y.; Li...

  41. [41]

    Yang, L.; Yu, Z.; Cui, B.; and Wang, M. 2025 c . ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates. arXiv:2502.06772

  42. [42]

    Zhang, B.; Zhang, X.; Zhang, J.; Yu, J.; Luo, S.; and Tang, J. 2025 a . CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis. arXiv:2501.01668

  43. [43]

    Zhang, Q.; Lyu, F.; Sun, Z.; Wang, L.; Zhang, W.; Hua, W.; Wu, H.; Guo, Z.; Wang, Y.; Muennighoff, N.; King, I.; Liu, X.; and Ma, C. 2025 b . A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well? arXiv:2503.24235