Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
Pith reviewed 2026-05-22 01:12 UTC · model grok-4.3
The pith
A separate reasoning module added to any frozen LLM via logit addition improves math reasoning and translation without retraining the base model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Universal Reasoner is a modular module that decomposes rewards into a decoupled reasoning component trained to translate trajectory-level signals into token-level adjustments. Once trained, the module combines with a frozen LLM by adding its logits to the backbone's logits, steering generation toward better reasoning paths. This additive structure enables joint application of multiple modules for complex reasoning and demonstrates weak-to-strong generalization across model sizes and domains.
What carries the argument
The additive logit combination of the UniR reasoning module with the frozen LLM backbone, which supplies per-token guidance derived from standalone reward training.
If this is right
- Multiple UniR modules trained for different tasks can be applied together by summing their logits to support complex reasoning.
- A UniR module trained on a smaller model can guide substantially larger models from the same family.
- The approach generalizes beyond text to vision-language models and to medical reasoning tasks.
- Performance on mathematical reasoning and machine translation exceeds results from existing fine-tuning methods.
Where Pith is reading between the lines
- The logit-addition pattern could be tested for other capabilities such as factual grounding or output style control.
- A library of reusable UniR-style modules might allow users to mix specialized skills on demand without retraining base models.
- The method may lower the cost of iterative improvement by letting developers update only the reasoning component over time.
Load-bearing premise
The output logits of the separately trained UniR module align sufficiently with those of the frozen LLM so that direct addition yields coherent improvements rather than interference or degradation.
What would settle it
Evaluating the combined UniR plus frozen LLM system on a standard mathematical reasoning benchmark and observing no accuracy gain or a drop in coherence compared to the frozen LLM alone.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Universal Reasoner (UniR), a modular, composable plug-and-play reasoning module for frozen LLMs. UniR is trained separately in a decoupled manner using verifiable rewards to map trajectories to token-level logit signals. At inference, the UniR logits are added directly to those of the frozen backbone LLM to provide specialized reasoning guidance. This additive structure is claimed to enable composition of multiple task-specific modules, weak-to-strong generalization across model sizes, and cross-domain transfer (e.g., to vision-language models and medical reasoning). Experiments are stated to show that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, with code open-sourced.
Significance. If the logit-addition mechanism proves stable and effective, the approach could enable efficient, parameter-free specialization of large frozen LLMs through reusable modules, reducing the need for per-backbone retraining. The emphasis on composability and weak-to-strong generalization, combined with open-sourced code, would represent a practical contribution to modular LLM enhancement if empirically substantiated.
major comments (2)
- [Abstract and method description] Abstract and method description: the central claim that simply adding UniR output logits to the frozen LLM logits delivers effective token-level guidance rests on an unexamined assumption of scale and semantic commensurability. The decoupled training with verifiable rewards does not address potential mismatches in logit magnitude, temperature, or calibration (especially when UniR is trained on smaller models and added to larger ones or when multiple modules are summed), and no normalization, learned mixing coefficient, or ablation comparing addition to alternatives such as concatenation or reranking is described. This is load-bearing for all performance and generalization claims.
- [Experimental claims] Experimental claims: the abstract asserts that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, yet provides no quantitative results, baselines, dataset specifications, error bars, statistical significance tests, or ablations isolating the contribution of the logit-addition operation. Without these, the empirical support for the superiority and stability claims cannot be evaluated.
minor comments (1)
- [Abstract] The abstract mentions generalization to vision-language models and medical reasoning without any supporting details, results, or dataset references; these claims should be either substantiated or removed from the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions that will strengthen the presentation of the logit-addition mechanism and the empirical claims.
read point-by-point responses
-
Referee: [Abstract and method description] Abstract and method description: the central claim that simply adding UniR output logits to the frozen LLM logits delivers effective token-level guidance rests on an unexamined assumption of scale and semantic commensurability. The decoupled training with verifiable rewards does not address potential mismatches in logit magnitude, temperature, or calibration (especially when UniR is trained on smaller models and added to larger ones or when multiple modules are summed), and no normalization, learned mixing coefficient, or ablation comparing addition to alternatives such as concatenation or reranking is described. This is load-bearing for all performance and generalization claims.
Authors: We acknowledge that explicit treatment of logit-scale commensurability is important for the additive mechanism. UniR is trained with a shared token vocabulary and verifiable rewards that encourage alignment with the backbone distribution; however, we did not detail normalization or mixing in the original submission. In the revision we will add a dedicated paragraph in the method section describing temperature scaling and per-module magnitude normalization, introduce a learned mixing coefficient as an optional hyperparameter, and include ablations that directly compare logit addition against hidden-state concatenation and reranking baselines. These additions will substantiate stability when modules are composed or transferred across model sizes. revision: yes
-
Referee: [Experimental claims] Experimental claims: the abstract asserts that UniR surpasses existing fine-tuning methods on mathematical reasoning and machine translation, yet provides no quantitative results, baselines, dataset specifications, error bars, statistical significance tests, or ablations isolating the contribution of the logit-addition operation. Without these, the empirical support for the superiority and stability claims cannot be evaluated.
Authors: The full manuscript already reports quantitative results, baselines, datasets, and ablations in the Experiments section, including tables that isolate the effect of logit addition. To improve accessibility we will insert the most salient performance deltas and dataset names into the abstract. We will also add explicit statistical significance tests and further ablations that hold all other factors fixed while varying only the addition operation. These changes address the referee’s concern without altering the core findings. revision: partial
Circularity Check
No significant circularity: empirical training and external evaluations ground the claims
full rationale
The paper presents UniR as a decoupled module trained on verifiable rewards to translate trajectory signals into token-level logits, which are then added to a frozen LLM backbone at inference. This structure is validated through direct experiments on mathematical reasoning and machine translation tasks, with reported improvements over PEFT baselines, plus demonstrations of composition and weak-to-strong generalization. No equations or derivations are shown that reduce any claimed prediction or result to a quantity defined in terms of itself or to a fitted parameter renamed as output. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided description. The method relies on independent training objectives and external task metrics rather than self-referential constructions, making the derivation chain self-contained against benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The token spaces of the reasoning module and the LLM backbone are shared or aligned.
invented entities (1)
-
Universal Reasoner (UniR) module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
log π_θ(yt|x,y<t) = log π_b(yt|x,y<t) + log π_r(yt|x,y<t) − log Z′ (Eq. 5); weighted sum of logits for multiple modules (Eq. 6)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery from Law of Logic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRPO objective on predefined rewards with frozen backbone (Eq. 8)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://open-thoughts.ai, 2025
Open thoughts. https://open-thoughts.ai, 2025. Accessed: 2025-05-05
work page 2025
-
[2]
A distributional view on multi-objective policy optimization
Abbas Abdolmaleki, Sandy Huang, Leonard Hasenclever, Michael Neunert, Francis Song, Martina Zambelli, Murilo Martins, Nicolas Heess, Raia Hadsell, and Martin Riedmiller. A distributional view on multi-objective policy optimization. In International conference on machine learning, pages 11–22. PMLR, 2020
work page 2020
-
[3]
Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025
Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Bespoke-stratos: The unreasonable effectiveness of reasoning distillation
Bespoke Labs. Bespoke-stratos: The unreasonable effectiveness of reasoning distillation. https://www.bespokelabs.ai/blog/ bespoke-stratos-the-unreasonable-effectiveness-of-reasoning-distillation ,
-
[6]
Overview of the iwslt 2017 evaluation campaign
Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuitho Sudoh, Koichiro Yoshino, and Christian Federmann. Overview of the iwslt 2017 evaluation campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation, pages 2–14, 2017
work page 2017
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[9]
Controlled text generation via language model arithmetic
Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, and Martin Vechev. Controlled text generation via language model arithmetic. arXiv preprint arXiv:2311.14479, 2023
-
[10]
Agent AI: Surveying the Horizons of Multimodal Interaction
Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Mt-r1-zero: Advancing llm-based machine translation via r1-zero-like reinforcement learning
Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, and Zuozhu Liu. Mt-r1-zero: Advancing llm-based machine translation via r1-zero-like reinforcement learning. arXiv preprint arXiv:2504.10160, 2025
-
[12]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023
work page 2023
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
xcomet: Transparent machine translation evaluation through fine-grained error detection
Nuno M Guerreiro, Ricardo Rei, Daan van Stigt, Luisa Coheur, Pierre Colombo, and An- dré FT Martins. xcomet: Transparent machine translation evaluation through fine-grained error detection. Transactions of the Association for Computational Linguistics, 12:979–995, 2024
work page 2024
-
[15]
Reinforcement learning with deep energy-based policies
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning , pages 1352–1361. PMLR, 2017
work page 2017
-
[16]
Value augmented sampling for language model alignment and personalization
Seungwook Han, Idan Shenfeld, Akash Srivastava, Yoon Kim, and Pulkit Agrawal. Value augmented sampling for language model alignment and personalization. arXiv preprint arXiv:2405.06639, 2024
-
[17]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 10
work page 2022
-
[19]
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025
work page 2025
-
[20]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022
work page 2022
-
[21]
Rain: Your language models can align themselves without finetuning
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023
-
[22]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[23]
Making ppo even better: Value-guided monte-carlo tree search decoding
Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Making ppo even better: Value-guided monte-carlo tree search decoding. Openreview https://openreview.net/forum?id=QaODpeRaOK, 2023
work page 2023
-
[24]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021
-
[26]
Controlled decoding from language models
Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023
-
[27]
s1: Simple test-time scaling, 2025
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025
work page 2025
-
[28]
OpenAI. Learning to reason with llms. https://openai.com/index/ learning-to-reason-with-llms/ , September 2024. Accessed: 2025-05-05
work page 2024
-
[29]
Bolt: Bootstrap long chain-of-thought in language models without distillation
Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, and Caiming Xiong. Bolt: Bootstrap long chain-of-thought in language models without distillation. arXiv preprint arXiv:2502.03860, 2025
-
[30]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[31]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[32]
Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task
Ricardo Rei, Marcos Treviso, Nuno M Guerreiro, Chrysoula Zerva, Ana C Farinha, Chris- tine Maroti, José GC De Souza, Taisiya Glushkova, Duarte M Alves, Alon Lavie, et al. Cometkiwi: Ist-unbabel 2022 submission for the quality estimation shared task. arXiv preprint arXiv:2209.06243, 2022
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Language models are multilingual chain-of-thought reasoners, 2022
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022
work page 2022
-
[36]
Offline rl for natural language generation with implicit language q learning
Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022. 11
-
[37]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
arXiv preprint arXiv:2501.09685 , year=
Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tom- maso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review. arXiv preprint arXiv:2501.09685, 2025
-
[39]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[40]
Genarm: Reward guided generation with autoregressive reward model for test-time alignment
Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment. arXiv preprint arXiv:2410.08193, 2024
-
[41]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Preference-grounded token-level guidance for language model fine-tuning
Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, and Mingyuan Zhou. Preference-grounded token-level guidance for language model fine-tuning. Advances in Neural Information Processing Systems, 36:24466–24496, 2023
work page 2023
-
[44]
Limo: Less is more for reasoning, 2025
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025
work page 2025
-
[45]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning
Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024
work page 2024
-
[48]
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 12 A Experimental Details A.1 Prompt template For all models in the LLaMA family, we modified the default chat template by removing the knowledge cutoff and the generation time, a...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.