pith. sign in

arxiv: 2505.19075 · v3 · pith:DDPEBBOKnew · submitted 2025-05-25 · 💻 cs.AI · cs.CL· cs.LG

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Pith reviewed 2026-05-22 01:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords universal reasonerplug-and-playfrozen LLMslogit additionmodular compositionweak-to-strong generalizationmathematical reasoningmachine translation
0
0 comments X

The pith

A separate reasoning module added to any frozen LLM via logit addition improves math reasoning and translation without retraining the base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reasoning capabilities can be extracted into a standalone module trained independently on verifiable rewards. This module supplies token-level guidance to frozen large language models simply by adding its output logits to the backbone model's outputs at inference time. A sympathetic reader would care because the additive design supports composing multiple modules for complex tasks, allows modules trained on small models to guide much larger ones in the same family, and extends to vision-language and medical domains while using fewer resources than full fine-tuning.

Core claim

Universal Reasoner is a modular module that decomposes rewards into a decoupled reasoning component trained to translate trajectory-level signals into token-level adjustments. Once trained, the module combines with a frozen LLM by adding its logits to the backbone's logits, steering generation toward better reasoning paths. This additive structure enables joint application of multiple modules for complex reasoning and demonstrates weak-to-strong generalization across model sizes and domains.

What carries the argument

The additive logit combination of the UniR reasoning module with the frozen LLM backbone, which supplies per-token guidance derived from standalone reward training.

If this is right

  • Multiple UniR modules trained for different tasks can be applied together by summing their logits to support complex reasoning.
  • A UniR module trained on a smaller model can guide substantially larger models from the same family.
  • The approach generalizes beyond text to vision-language models and to medical reasoning tasks.
  • Performance on mathematical reasoning and machine translation exceeds results from existing fine-tuning methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The logit-addition pattern could be tested for other capabilities such as factual grounding or output style control.
  • A library of reusable UniR-style modules might allow users to mix specialized skills on demand without retraining base models.
  • The method may lower the cost of iterative improvement by letting developers update only the reasoning component over time.

Load-bearing premise

The output logits of the separately trained UniR module align sufficiently with those of the frozen LLM so that direct addition yields coherent improvements rather than interference or degradation.

What would settle it

Evaluating the combined UniR plus frozen LLM system on a standard mathematical reasoning benchmark and observing no accuracy gain or a drop in coherence compared to the frozen LLM alone.

Figures

Figures reproduced from arXiv: 2505.19075 by Choonghan Kim, Hangeol Chang, Hyunmin Hwang, Jaemin Kim, Jong Chul Ye.

Figure 1
Figure 1. Figure 1: UniR Framework Overview. Our approach trains a lightweight, transferable reasoning module (πr) using predefined rewards to guide a frozen backbone model (πb), offering (1) trans￾ferability across different backbone models or tasks; and (2) composability by combining multiple specialized reasoning modules through reward optimization. meticulously selected from high-quality examples (e.g., S1 [27], LIMO [44]… view at source ↗
Figure 2
Figure 2. Figure 2: Effectiveness of Reasoning Policy Transfer. Results demonstrate that a trained reasoning module can improve performance when integrated with larger backbone models across diverse mathematical reasoning tasks. CometKiwi, and XComet-XL 3 [14] scores. Further details on training hyperparameters, optimization settings, evaluation details and prompt templates are provided in Appendix A. 5.2 Enhancing Reasoning … view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning Performance of the πr Module. (Left) The backbone model πb and (Middle) the standalone reasoning module πr produce incorrect, repetitive and logically flawed reasoning. (Right) When combined, it generates coherent reasoning and arrive at the correct solution, showing the effectiveness of the modular guidance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance on a German-to-English Math problem-solving task. The numbers in the figure indicate the value of α. where α ∈ [0, 1] is a coefficient that balances the influence between both modules. We employ GPT-4.1-nano to evaluate translation qual￾ity and the accuracy of the generated output. Detailed configurations for this experiment are provided in Appendix A. As depicted in [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5
Figure 5. Figure 5: VRAM usage versus batch size under an 80GB constraint. Our method scales to batch size 128, while full fine-tuning and LoRA are limited, demonstrating memory efficiency for large batch [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: For GSM8K, the prompt specifies a reasoning-then-answer format, where the model is [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: In the Math-12K prompt format used with the LLaMA and Qwen models, answers follow [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used for Translation English-to-German and German-to-English. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt used for English to German Math task: specifies the reasoning-then-answer [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used for scoring English-to-German Math task: Translation quality and Math [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Chain-of-thought comparison on a GSM8k example: while the Base and GRPO models [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of bidirectional translation (German (DE) and English (EN)). Our approach [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: We illustrate outputs for a German math problem solved in English. (Top) Guidance [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The Backbone Model and Reasoning Module both demonstrate flawed reasoning pro [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Illustrative examples of responses from the baseline VLM and our UniR-extended [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Transferability of the 0.5B πr and 1.5B πr reasoning modules when combined with a 14B frozen backbone model. The 1.5B πr module demonstrated superior performance. Method Trained Model In-distribution Out-of-distribution Avg. GSM8K MATH-500 AIME24 Minerva OlympiadBench Qwen2.5-3B Baseline - 75.5 46.8 6.7 23.5 25.5 35.6 Baseline + 0.5B - 72.1 41.8 6.7 16.2 22.1 31.8 Baseline + 1.5B - 76.0 50.2 6.7 23.1 24.7… view at source ↗
Figure 18
Figure 18. Figure 18: Performance comparison on the GSM8K dataset between standalone reasoning modules [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Universal Reasoner (UniR), a modular, composable plug-and-play reasoning module for frozen LLMs. UniR is trained separately in a decoupled manner using verifiable rewards to map trajectories to token-level logit signals. At inference, the UniR logits are added directly to those of the frozen backbone LLM to provide specialized reasoning guidance. This additive structure is claimed to enable composition of multiple task-specific modules, weak-to-strong generalization across model sizes, and cross-domain transfer (e.g., to vision-language models and medical reasoning). Experiments are stated to show that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, with code open-sourced.

Significance. If the logit-addition mechanism proves stable and effective, the approach could enable efficient, parameter-free specialization of large frozen LLMs through reusable modules, reducing the need for per-backbone retraining. The emphasis on composability and weak-to-strong generalization, combined with open-sourced code, would represent a practical contribution to modular LLM enhancement if empirically substantiated.

major comments (2)
  1. [Abstract and method description] Abstract and method description: the central claim that simply adding UniR output logits to the frozen LLM logits delivers effective token-level guidance rests on an unexamined assumption of scale and semantic commensurability. The decoupled training with verifiable rewards does not address potential mismatches in logit magnitude, temperature, or calibration (especially when UniR is trained on smaller models and added to larger ones or when multiple modules are summed), and no normalization, learned mixing coefficient, or ablation comparing addition to alternatives such as concatenation or reranking is described. This is load-bearing for all performance and generalization claims.
  2. [Experimental claims] Experimental claims: the abstract asserts that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, yet provides no quantitative results, baselines, dataset specifications, error bars, statistical significance tests, or ablations isolating the contribution of the logit-addition operation. Without these, the empirical support for the superiority and stability claims cannot be evaluated.
minor comments (1)
  1. [Abstract] The abstract mentions generalization to vision-language models and medical reasoning without any supporting details, results, or dataset references; these claims should be either substantiated or removed from the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions that will strengthen the presentation of the logit-addition mechanism and the empirical claims.

read point-by-point responses
  1. Referee: [Abstract and method description] Abstract and method description: the central claim that simply adding UniR output logits to the frozen LLM logits delivers effective token-level guidance rests on an unexamined assumption of scale and semantic commensurability. The decoupled training with verifiable rewards does not address potential mismatches in logit magnitude, temperature, or calibration (especially when UniR is trained on smaller models and added to larger ones or when multiple modules are summed), and no normalization, learned mixing coefficient, or ablation comparing addition to alternatives such as concatenation or reranking is described. This is load-bearing for all performance and generalization claims.

    Authors: We acknowledge that explicit treatment of logit-scale commensurability is important for the additive mechanism. UniR is trained with a shared token vocabulary and verifiable rewards that encourage alignment with the backbone distribution; however, we did not detail normalization or mixing in the original submission. In the revision we will add a dedicated paragraph in the method section describing temperature scaling and per-module magnitude normalization, introduce a learned mixing coefficient as an optional hyperparameter, and include ablations that directly compare logit addition against hidden-state concatenation and reranking baselines. These additions will substantiate stability when modules are composed or transferred across model sizes. revision: yes

  2. Referee: [Experimental claims] Experimental claims: the abstract asserts that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, yet provides no quantitative results, baselines, dataset specifications, error bars, statistical significance tests, or ablations isolating the contribution of the logit-addition operation. Without these, the empirical support for the superiority and stability claims cannot be evaluated.

    Authors: The full manuscript already reports quantitative results, baselines, datasets, and ablations in the Experiments section, including tables that isolate the effect of logit addition. To improve accessibility we will insert the most salient performance deltas and dataset names into the abstract. We will also add explicit statistical significance tests and further ablations that hold all other factors fixed while varying only the addition operation. These changes address the referee’s concern without altering the core findings. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical training and external evaluations ground the claims

full rationale

The paper presents UniR as a decoupled module trained on verifiable rewards to translate trajectory signals into token-level logits, which are then added to a frozen LLM backbone at inference. This structure is validated through direct experiments on mathematical reasoning and machine translation tasks, with reported improvements over PEFT baselines, plus demonstrations of composition and weak-to-strong generalization. No equations or derivations are shown that reduce any claimed prediction or result to a quantity defined in terms of itself or to a fitted parameter renamed as output. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided description. The method relies on independent training objectives and external task metrics rather than self-referential constructions, making the derivation chain self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that token spaces can be aligned sufficiently for logit addition to be effective and that verifiable rewards can be decomposed into standalone token-level guidance without architectural dependencies on the backbone.

axioms (1)
  • domain assumption The token spaces of the reasoning module and the LLM backbone are shared or aligned.
    Stated in the abstract as a requirement for combining the modules by adding output logits.
invented entities (1)
  • Universal Reasoner (UniR) module no independent evidence
    purpose: To act as a standalone reasoning component that provides token-level guidance to frozen LLMs via logit addition and supports modular composition.
    New module introduced by the paper; no independent evidence outside the described experiments is provided.

pith-pipeline@v0.9.0 · 5799 in / 1433 out tokens · 62831 ms · 2026-05-22T01:12:04.889510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 13 internal anchors

  1. [1]

    https://open-thoughts.ai, 2025

    Open thoughts. https://open-thoughts.ai, 2025. Accessed: 2025-05-05

  2. [2]

    A distributional view on multi-objective policy optimization

    Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Francis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, and Martin Riedmiller. A distributional view on multi-objective policy optimization. In International conference on machine learning, pages 11–22. PMLR, 2020

  3. [3]

    Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

    Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  5. [5]

    Bespoke-stratos: The unreasonable effectiveness of reasoning distillation

    Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://www.bespokelabs.ai/blog/ bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation ,

  6. [6]

    Overview of the iwslt 2017 evaluation campaign

    Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuitho Sudoh, Koichiro Yoshino, and Christian Federmann. Overview of the iwslt 2017 evaluation campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation, pages 2–14, 2017

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  9. [9]

    Controlled text generation via language model arithmetic

    Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. Controlled text generation via language model arithmetic. arXiv preprint arXiv:2311.14479, 2023

  10. [10]

    Agent AI: Surveying the Horizons of Multimodal Interaction

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024

  11. [11]

    Mt-r1-zero: Advancing llm-based machine translation via r1-zero-like reinforcement learning

    Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, and Zuozhu Liu. Mt-r1-zero: Advancing llm-based machine translation via r1-zero-like reinforcement learning. arXiv preprint arXiv:2504.10160, 2025

  12. [12]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    xcomet: Transparent machine translation evaluation through fine-grained error detection

    Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and An- dré FT Martins. xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12:979–995, 2024

  15. [15]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning , pages 1352–1361. PMLR, 2017

  16. [16]

    Value augmented sampling for language model alignment and personalization

    Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, and Pulkit Agrawal. Value augmented sampling for language model alignment and personalization. arXiv preprint arXiv:2405.06639, 2024

  17. [17]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024

  18. [18]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 10

  19. [19]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

  20. [20]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  21. [21]

    Rain: Your language models can align themselves without finetuning

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023

  22. [22]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  23. [23]

    Making ppo even better: Value-guided monte-carlo tree search decoding

    Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Making ppo even better: Value-guided monte-carlo tree search decoding. Openreview https://openreview.net/forum?id=QaODpeRaOK, 2023

  24. [24]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

  25. [25]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021

  26. [26]

    Mudgal, J

    Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023

  27. [27]

    s1: Simple test-time scaling, 2025

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025

  28. [28]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/ , September 2024. Accessed: 2025-05-05

  29. [29]

    Bolt: Bootstrap long chain-of-thought in language models without distillation

    Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, and Caiming Xiong. Bolt: Bootstrap long chain-of-thought in language models without distillation. arXiv preprint arXiv:2502.03860, 2025

  30. [30]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  31. [31]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  32. [32]

    Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task

    Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Chris- tine Maroti, José GC De Souza, Taisiya Glushkova, Duarte M Alves, Alon Lavie, et al. Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task. arXiv preprint arXiv:2209.06243, 2022

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  35. [35]

    Language models are multilingual chain-of-thought reasoners, 2022

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

  36. [36]

    Offline rl for natural language generation with implicit language q learning

    Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022. 11

  37. [37]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  38. [38]

    Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review

    Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tom- maso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review. arXiv preprint arXiv:2501.09685, 2025

  39. [39]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  40. [40]

    Genarm: Reward guided generation with autoregressive reward model for test-time alignment

    Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment. arXiv preprint arXiv:2410.08193, 2024

  41. [41]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  42. [42]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

  43. [43]

    Preference-grounded token-level guidance for language model fine-tuning

    Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, and Mingyuan Zhou. Preference-grounded token-level guidance for language model fine-tuning. Advances in Neural Information Processing Systems, 36:24466–24496, 2023

  44. [44]

    Limo: Less is more for reasoning, 2025

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025

  45. [45]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  46. [46]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023

  47. [47]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024

  48. [48]

    What about

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 12 A Experimental Details A.1 Prompt template For all models in the LLaMA family, we modified the default chat template by removing the knowledge cutoff and the generation time, a...