Recognition: 2 theorem links
· Lean TheoremTwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning
Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3
The pith
A planner learns to invoke slow reasoning only for hard user histories in generative recommendation, raising accuracy while lowering latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that equipping a generative recommender with a fast SID retriever, a candidate ranker, and a slow reasoning model that converts collaborative item-to-item knowledge into natural-language rationales, then training a planner via supervised warm-up and agentic RL to decide which tool to call for each user sequence, produces both higher accuracy and lower latency than any single fixed strategy applied uniformly.
What carries the argument
The planner, trained through supervised warm-up followed by agentic reinforcement learning, that dynamically selects among the fast SID-based retriever, lightweight ranker, or slow reasoning model with injected commonsense explanations.
Load-bearing premise
The planner can reliably detect which user sequences benefit from slow reasoning and that the injected item-to-item commonsense explanations remain useful across different datasets.
What would settle it
An evaluation on the same three datasets in which the adaptive planner produces no accuracy gain over the best fixed-strategy baseline or fails to reduce average latency would falsify the claim.
Figures
read the original abstract
Generative recommendation with Semantic IDs (SIDs) has emerged as a promising paradigm, yet existing methods apply a fixed inference strategy, either fast direct generation or slow chain-of-thought reasoning, uniformly across all user histories. This approach creates a trade-off: fast recommendation model produces suboptimal accuracy on hard samples, while always invoking slow reasoning incurs prohibitive latency and wastes computation on easy cases. To address this, we propose Think Fast, Think Slow, Then Act, a framework that learns to adaptively allocate reasoning effort per user sequence. Our system equips an LLM with three complementary tools: a fast SID-based retriever, a lightweight candidate ranker, and a slow reasoning model that generates explicit rationales before recommending. Crucially, we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations. A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke. Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TwiSTAR, a generative recommendation system using Semantic IDs (SIDs) that equips an LLM with three tools—a fast SID retriever, a lightweight ranker, and a slow reasoning model augmented with natural-language item-to-item commonsense explanations. A planner, trained first by supervised warm-up and then by agentic reinforcement learning, adaptively selects which tool to invoke per user sequence. The central empirical claim is that this adaptive allocation yields consistent accuracy gains over strong baselines while reducing inference latency relative to always invoking the slow reasoning path, demonstrated on three datasets.
Significance. If the planner reliably routes hard sequences to slow reasoning and the commonsense explanations measurably improve the slow path, the framework offers a practical way to resolve the accuracy–latency trade-off in LLM-based generative recommenders. The combination of tool use, commonsense injection, and agentic RL for routing is a coherent extension of recent work on adaptive inference. However, the absence of direct validation for the planner’s decisions and the contribution of the injected explanations prevents a clear assessment of whether the reported gains are attributable to the adaptive mechanism itself.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the claim of 'consistent accuracy gains' and 'reducing inference latency' is presented without any numerical results, error bars, statistical tests, per-dataset breakdowns, or description of how the planner was evaluated. This omission makes the central empirical claim unverifiable from the provided text.
- [Experiments] Experiments section: no ablation is reported that disables the planner (e.g., uniform slow reasoning, random routing, or always-fast) or removes the commonsense injection. Without these controls it is impossible to determine whether accuracy improvements arise from adaptive allocation or simply from the presence of multiple tools.
- [Method] Method / Planner subsection: the manuscript provides no planner decision statistics (percentage of sequences routed to slow reasoning, agreement with an oracle that knows when slow reasoning helps, or routing accuracy on held-out data). This leaves the weakest assumption—that the planner reliably identifies sequences requiring slow reasoning—unsupported by direct evidence.
minor comments (2)
- [Title] The title contains a missing space after the colon ('TwiSTAR:Think Fast').
- [Abstract] The abstract refers to 'three complementary tools' but does not explicitly list the lightweight candidate ranker in the final sentence; a brief clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on TwiSTAR. The comments highlight important areas for strengthening the empirical validation of our adaptive reasoning framework. We address each major comment below and will incorporate revisions to improve clarity and evidence.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of 'consistent accuracy gains' and 'reducing inference latency' is presented without any numerical results, error bars, statistical tests, per-dataset breakdowns, or description of how the planner was evaluated. This omission makes the central empirical claim unverifiable from the provided text.
Authors: We agree that the abstract presents high-level claims. The Experiments section reports per-dataset accuracy and latency results across three datasets, but lacks explicit error bars, statistical tests, and a detailed planner evaluation description in the main text. In revision, we will update the abstract with key numerical highlights (e.g., relative gains), add error bars and significance tests to experimental tables, include per-dataset breakdowns with planner routing details, and expand the planner evaluation description to make all claims directly verifiable. revision: yes
-
Referee: [Experiments] Experiments section: no ablation is reported that disables the planner (e.g., uniform slow reasoning, random routing, or always-fast) or removes the commonsense injection. Without these controls it is impossible to determine whether accuracy improvements arise from adaptive allocation or simply from the presence of multiple tools.
Authors: This is a valid concern. Our experiments compare against fixed fast and slow baselines, but do not include explicit ablations for random routing or commonsense removal. We will add these controls in the revised Experiments section: always-fast, uniform slow reasoning, random tool selection, and slow reasoning without commonsense explanations. These will isolate the planner's adaptive contribution and the value of injected explanations. revision: yes
-
Referee: [Method] Method / Planner subsection: the manuscript provides no planner decision statistics (percentage of sequences routed to slow reasoning, agreement with an oracle that knows when slow reasoning helps, or routing accuracy on held-out data). This leaves the weakest assumption—that the planner reliably identifies sequences requiring slow reasoning—unsupported by direct evidence.
Authors: We acknowledge that direct planner diagnostics are missing. The manuscript emphasizes end-to-end results but omits routing statistics. In revision, we will add a dedicated analysis with the percentage of sequences routed to slow reasoning, correlation with sequence difficulty, and routing accuracy or oracle agreement metrics on held-out data where available. This will provide direct evidence supporting the planner's decisions. revision: yes
Circularity Check
No significant circularity; empirical framework with no derivations or self-referential reductions
full rationale
The paper describes an engineering framework for adaptive tool use in generative recommendation (fast SID retriever, ranker, slow reasoning model with injected commonsense) and a planner trained via supervised warm-up plus agentic RL. No equations, first-principles derivations, or closed-form predictions appear in the provided text. Central claims rest on empirical comparisons across three datasets rather than any quantity that reduces to its own fitted inputs by construction. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The planner's routing behavior and commonsense injection are presented as design choices whose value is asserted via experiment, not via definitional equivalence. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can usefully generate explicit rationales for item recommendations when given collaborative patterns in natural language
- domain assumption A planner trained with supervised warm-up plus agentic RL can learn to allocate fast versus slow tools without excessive overhead
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A planner, trained through supervised warm-up followed by agentic reinforcement learning, dynamically decides which tool to invoke... Experiments on three datasets demonstrate that our method outperforms strong baselines, achieving consistent accuracy gains while reducing inference latency compared to uniform slow reasoning.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we inject collaborative commonsense into the slow model by transforming item-to-item knowledge into natural language explanations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LLaRA: Large language-recommendation assistant
Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xi- angnan He. LLaRA: Large language-recommendation assistant. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 1785–1795, New York, NY , USA, 2024. Association for Com- puting Machinery. ISBN 97...
-
[2]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
OneRec- Think: In-text reasoning for generative recommendation, 2025
Zhanyu Liu, Shiyao Wang, Xingmei Wang, Rongzhou Zhang, Jiaxin Deng, Honghui Bao, Jinghao Zhang, Wuchao Li, Pengfei Zheng, Xiangyu Wu, Yifei Hu, Qigen Hu, Xinchen Luo, Lejian Ren, Zixing Zhang, Qianqian Wang, Kuo Cai, Yunfan Wu, Hongtao Cheng, Zexuan Cheng, Lu Ren, Huanjie Wang, Yi Su, Ruiming Tang, Kun Gai, and Guorui Zhou. OneRec- Think: In-text reasonin...
-
[4]
Generative reasoning recommendation via LLMs, 2025
Minjie Hong, Zetong Zhou, Zirun Guo, Ziang Zhang, Ruofan Hu, Weinan Gan, Jieming Zhu, and Zhou Zhao. Generative reasoning recommendation via LLMs, 2025. URL https: //arxiv.org/abs/2510.20815
-
[5]
OxygenREC: An instruction-following generative framework for e-commerce recommendation
Xuegang Hao, Ming Zhang, Alex Li, Xiangyu Qian, Zhi Ma, Yanlong Zang, Shijie Yang, Zhongxuan Han, Xiaolong Ma, Jinguang Liu, Zhen Li, Zhida Jiang, Shusheng Wang, Ning Tang, Yanchen Qiao, Chenxiang Yang, Chen Sun, Jincheng Yuan, Chunhua Peng, Heng Hu, Peijun Yang, Baopeng Yuan, Caiyun Qiu, Zhaolong Xing, Haofei Yuan, Haipeng Zhang, Yuzhang Guo, Weijie Ding...
-
[6]
An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dim- itris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora...
-
[8]
Claude 3.7 Sonnet and Claude Code
Anthropic. Claude 3.7 Sonnet and Claude Code. https://www.anthropic.com/news/ claude-3-7-sonnet, 2025. Accessed: 2026-04-29
work page 2025
-
[9]
Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. M6-Rec: Generative pretrained language models are open-ended recommender systems, 2022. URL https:// arxiv.org/abs/2205.08084
-
[10]
Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. TALLRec: An effective and efficient tuning framework to align large language model with recommen- dation. InProceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1007–1014, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9798400702419...
-
[11]
LLMTreeRec: Unleashing the power of large language models for cold-start recommendations, 2024
Wenlin Zhang, Chuhan Wu, Xiangyang Li, Yuhao Wang, Kuicai Dong, Yichao Wang, Xinyi Dai, Xiangyu Zhao, Huifeng Guo, and Ruiming Tang. LLMTreeRec: Unleashing the power of large language models for cold-start recommendations, 2024. URL https://arxiv.org/ abs/2404.00702
-
[12]
A bi-step grounding paradigm for large language models in recommendation systems.ACM Trans
Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yanchen Luo, Chong Chen, Fuli Feng, and Qi Tian. A bi-step grounding paradigm for large language models in recommendation systems.ACM Trans. Recomm. Syst., 3(4), April 2025. doi: 10.1145/3716393. URLhttps://doi.org/10.1145/3716393. 10
-
[13]
On softmax direct preference optimization for recommendation
Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. On softmax direct preference optimization for recommendation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates Inc. ISBN 9798331314385
work page 2024
-
[14]
InteraRec: Interactive recommendations us- ing multimodal large language models
Saketh Reddy Karra and Theja Tulabandhula. InteraRec: Interactive recommendations us- ing multimodal large language models. InTrends and Applications in Knowledge Discov- ery and Data Mining: PAKDD 2024 Workshops, RAFDA and IWTA, Taipei, Taiwan, May 7–10, 2024, Proceedings, page 32–43, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978- 981-97-2649-3. do...
-
[15]
Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5),
- [16]
-
[17]
Zhixuan Chu, Hongyan Hao, Xin Ouyang, Simeng Wang, Yan Wang, Yue Shen, Jinjie Gu, Qing Cui, Longfei Li, Siqiao Xue, James Y Zhang, and Sheng Li. Leveraging large language models for pre-trained recommender systems, 2023. URLhttps://arxiv.org/abs/2308.10837
-
[18]
Adapting large language models by integrating collaborative semantics for recommen- dation
Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, Ming Chen, and Ji-Rong Wen. Adapting large language models by integrating collaborative semantics for recommen- dation. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 1435–1448, 2024. doi: 10.1109/ICDE60146.2024.00118
-
[19]
PLUM: Adapting pre-trained language models for industrial-scale generative recommendations
Ruining He, Lukasz Heldt, Lichan Hong, Raghunandan Keshavan, Shifan Mao, Nikhil Mehta, Zhengyang Su, Alicia Tsai, Yueqi Wang, Shao-Chuan Wang, Xinyang Yi, Lexi Baugher, Baykal Cakici, Ed Chi, Cristos Goodrow, Ningren Han, He Ma, Romer Rosales, Abby Van Soest, Devansh Tandon, Su-Lin Wu, Weilong Yang, and Yilin Zheng. PLUM: Adapting pre-trained language mod...
-
[20]
Fine-grained semantics integration for large language model-based recommendation, 2026
Jiawei Feng, Xiaoyu Kong, Leheng Sheng, Bin Wu, Chao Yi, Feifang Yang, Xiang-Rong Sheng, Han Zhu, Xiang Wang, Jiancan Wu, and Xiangnan He. Fine-grained semantics integration for large language model-based recommendation, 2026. URL https://arxiv.org/abs/2602. 22632
work page 2026
-
[21]
Reasoning over semantic IDs enhances generative recommenda- tion, 2026
Yingzhi He, Yan Sun, Junfei Tan, Yuxin Chen, Xiaoyu Kong, Chunxu Shen, Xiang Wang, An Zhang, and Tat-Seng Chua. Reasoning over semantic IDs enhances generative recommenda- tion, 2026. URLhttps://arxiv.org/abs/2603.23183
-
[22]
arXiv:2411.13789 [cs.IR] https://arxiv.org/ abs/2411.13789
Fengxin Li, Yi Li, Yue Liu, Chao Zhou, Yuan Wang, Xiaoxiang Deng, Wei Xue, Dapeng Liu, Lei Xiao, Haijie Gu, Jie Jiang, Hongyan Liu, Biao Qin, and Jun He. Leadre: Multi-faceted knowledge enhanced llm empowered display advertisement recommender system, 2025. URL https://arxiv.org/abs/2411.13789
-
[23]
Thomas, Alexandra Ranieri, Matthew N
Edoardo D’Amico, Marco De Nadai, Praveen Chandar, Divita V ohra, Shawn Lin, Max Lefarov, Paul Gigioli, Gustavo Penha, Ilya Kopysitsky, Ivo Joel Senese, Darren Mei, Francesco Fab- bri, Oguz Semerci, Yu Zhao, Vincent Tang, Brian St. Thomas, Alexandra Ranieri, Matthew N. K. Smith, Aaron Bernkopf, Bryan Leung, Ghazal Fazelnia, Mark VanMiddlesworth, Timo- thy ...
-
[24]
Gould, Yves Raimond, Sandeep Ghael, Tony Jebara, 11 Andreas Damianou, Vladan Radosavljevic, Paul N
Marco De Nadai, Edoardo D’Amico, Max Lefarov, Alexandre Tamborrino, Divita V ohra, Mark VanMiddlesworth, Shawn Lin, Jacqueline Wood, Jan Stypka, Eliza Klyce, Keshi Dai, Timothy Christopher Heath, Martin D. Gould, Yves Raimond, Sandeep Ghael, Tony Jebara, 11 Andreas Damianou, Vladan Radosavljevic, Paul N. Bennett, Mounia Lalmas, and Praveen Chandar. A unif...
-
[25]
Generative reasoning re-ranker, 2026
Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Wenlin Chen, Santanu Kolay, Sandeep Pandey, Hamed Firooz, and Luke Simon. Generative reasoning re-ranker, 2026. URL https: //arxiv.org/abs/2...
- [26]
-
[27]
RecGPT-V2 technical report.arXiv preprint arXiv:2512.14503, 2025
Chao Yi, Dian Chen, Gaoyang Guo, Jiakai Tang, Jian Wu, Jing Yu, Mao Zhang, Wen Chen, Wenjun Yang, Yujie Luo, Yuning Jiang, Zhujin Gao, Bo Zheng, Binbin Cao, Changfa Wu, Dixuan Wang, Han Wu, Haoyi Hu, Kewei Zhu, Lang Tian, Lin Yang, Qiqi Huang, Siqi Yang, Wenbo Su, Xiaoxiao He, Xin Tong, Xu Chen, Xunke Xi, Xiaowei Huang, Yaxuan Wu, Yeqiu Yang, Yi Hu, Yujin...
-
[28]
Recbot: Agent-based recommendation system.arXiv preprint arXiv:2509.21317,
Jiakai Tang, Yujie Luo, Xunke Xi, Fei Sun, Xueyang Feng, Sunhao Dai, Chao Yi, Dian Chen, Zhujin Gao, Yang Li, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, and Bo Zheng. Interactive recommendation agent with active user commands.arXiv preprint arXiv:2509.21317, 2025
-
[29]
Seungheon Doh, Keunwoo Choi, and Juhan Nam. TalkPlay-Tools: Conversational music recommendation with LLM tool calling.arXiv preprint arXiv:2510.01698, 2025
-
[30]
Deep interest network for click-through rate prediction
Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pages 1059–1068, New York, NY , USA, 2018. Association for Computing Machi...
-
[31]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
work page 2026
-
[32]
Image-based recommendations on styles and substitutes, 2015
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes, 2015. URL https://arxiv.org/abs/1506. 04757
work page 2015
-
[33]
Hierarchical gating networks for sequential recommenda- tion, 2019
Chen Ma, Peng Kang, and Xue Liu. Hierarchical gating networks for sequential recommenda- tion, 2019. URLhttps://arxiv.org/abs/1906.09217
-
[34]
Session-based Recommendations with Recurrent Neural Networks
Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks, 2016. URL https://arxiv.org/abs/ 1511.06939
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[35]
Self-attentive sequential recommendation, 2018
Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation, 2018. URL https://arxiv.org/abs/1808.09781
-
[36]
Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q
Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. Recommender systems with generative retrieval, 2023. URL https://arxiv.org/abs/2305.05065
-
[37]
Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, Yinghai Lu, and Yu Shi. Actions speak louder than words: Trillion-parameter sequential transducers for generative recommendations, 2024. URL https: //arxiv.org/abs/2402.17152
work page internal anchor Pith review arXiv 2024
-
[38]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805. 12
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[39]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300. 13 A Detailed Reward Design for GRPO Training We describe the reward function used to train t...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.