Recognition: 2 theorem links
· Lean TheoremInternalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
Pith reviewed 2026-05-13 02:14 UTC · model grok-4.3
The pith
A language model can learn to judge which training prompts will help it most by predicting its own reward variance from recent examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
METIS internalizes curriculum judgment as a native capability of the policy. It uses within-prompt reward variance as a gauge of prompt informativeness, predicts this variance from recent training outcomes treated as in-context examples, and jointly optimizes the standard RFT objective together with an additional self-judgment reward so the policy learns what to learn next.
What carries the argument
METIS self-judgment mechanism: the policy predicts within-prompt reward variance from recent training history as in-context examples and receives a joint reward for both task success and accurate judgment.
If this is right
- The policy reaches higher performance on mathematical reasoning, code generation, and agentic function-calling benchmarks.
- Training converges up to 67 percent faster than methods that rely on handcrafted heuristics or auxiliary models.
- Curriculum decisions become aligned directly with the policy's evolving training dynamics.
- The model acquires an internal ability to decide what to learn next without external guidance.
Where Pith is reading between the lines
- The same variance-based self-assessment could be tested in other reward-driven training settings where example ordering matters.
- Jointly training judgment and task performance may reduce the engineering effort needed to maintain separate curriculum modules.
- If the approach scales, future models might handle their own data selection with less human-specified structure.
Load-bearing premise
Reward variance inside a prompt reliably indicates how informative that prompt is for training, and the policy can accurately predict this variance from its own recent outcomes while learning to judge itself at the same time.
What would settle it
Running the same benchmarks with METIS and with standard external curriculum methods, then finding no difference in final performance or convergence speed, would show the internalized judgment adds nothing.
Figures
read the original abstract
In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces METIS (METacognitive Internalized Self-judgment), a framework for LLM reinforcement fine-tuning (RFT) that internalizes curriculum judgment. It uses within-prompt reward variance as a proxy for prompt informativeness, predicts this variance via in-context learning from recent training outcomes, and jointly optimizes the policy with both standard RFT rewards and a self-judgment reward to enable dynamic training allocation without external heuristics or auxiliary models. The work claims consistent superiority and up to 67% faster convergence across discrete and continuous benchmarks in mathematical reasoning, code generation, and agentic function-calling.
Significance. If the core claims hold after validation, the approach would be significant for RFT by providing a simple closed-loop alternative to handcrafted curricula, potentially improving alignment and efficiency. The metacognitive internalization idea is conceptually novel and could reduce reliance on external components, but the absence of supporting analysis for the variance proxy and experimental details currently limits its assessed impact.
major comments (3)
- [Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.
- [§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.
- [Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.
minor comments (2)
- [Abstract] Abstract: The METIS acronym expansion is given but could be introduced more explicitly at first use for clarity.
- [§3] Notation: The description of in-context examples from 'recent training outcomes' would benefit from a precise definition or pseudocode in the methods to avoid ambiguity in implementation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that additional supporting analyses and experimental details will strengthen the manuscript and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.
Authors: We agree that the manuscript would benefit from explicit preliminary validation of within-prompt reward variance as a proxy for prompt informativeness. While the paper presents this as a critical observation motivated by RFT dynamics, we did not include dedicated correlation studies or comparisons in the initial version. In the revised manuscript, we will add a subsection to §3 containing correlation analysis between reward variance and learning progress metrics (e.g., policy improvement on held-out prompts), along with direct comparisons to alternatives such as average reward magnitude and gradient norms. These will include quantitative coefficients and visualizations to better ground the curriculum decision mechanism. revision: yes
-
Referee: [§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.
Authors: The potential for circularity in the joint optimization is a fair concern. The variance prediction uses in-context learning from prior training outcomes, which are generated before the current optimization step, providing temporal separation. Nevertheless, to isolate the prediction accuracy, the revised version will include a dedicated ablation in §4 and the Experiments section. This will evaluate the ICL variance predictor independently on held-out prompts, comparing predictions against ground-truth informativeness derived from external metrics or oracle learning progress, decoupled from the joint training objective. We expect this to confirm that the metacognitive signal contributes meaningfully beyond optimization artifacts. revision: yes
-
Referee: [Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.
Authors: We acknowledge that the experimental reporting was incomplete in the submitted manuscript. The experiments compare against standard RFT and external heuristic baselines as described in §5, but omitted run counts, variability measures, and statistical tests. In the revision, we will specify the number of independent runs, add error bars to all tables and convergence plots, and include statistical significance tests (e.g., paired t-tests) to support the reported performance gains and convergence acceleration. We will also add ablations isolating the ICL prediction component to demonstrate its contribution relative to the full METIS framework. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core method rests on an empirical observation that within-prompt reward variance gauges informativeness, followed by ICL-based prediction of that metric from recent outcomes and joint optimization of RFT rewards plus a self-judgment reward. No equations, derivations, or self-citations are provided that reduce any claimed prediction or result to its inputs by construction. The closed-loop design is presented as a deliberate architectural choice for metacognition rather than a tautological equivalence, and performance claims are grounded in external benchmarks across tasks rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Within-prompt reward variance effectively gauges prompt informativeness
invented entities (1)
-
METIS framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Harvard University Press, 1978
Lev S Vygotsky.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, 1978
work page 1978
-
[2]
John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979
work page 1979
-
[3]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[4]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Vcrl: Variance-based curriculum reinforcement learning for large language models
Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models, 2025. URL https://arxiv.org/abs/2509.19803
-
[9]
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026. URLhttps://arxiv.org/abs/2506.05316
-
[10]
Prompt curriculum learning for efficient LLM post-training
Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient LLM post-training. InThe Fourteenth International Confer- ence on Learning Representations, 2026. URLhttps://openreview.net/forum?id=zqOCacBD3P
work page 2026
-
[11]
Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025
-
[12]
Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025
-
[13]
Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, and Lu Qianchun. Learning like humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6619–6633. Association for Computational Linguistics...
work page 2025
-
[14]
Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang
Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning, 2026. URLhttps://arxiv.org/abs/2506.06632
-
[15]
Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025
Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025. URLhttps://arxiv.org/abs/2510.26143
-
[16]
Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025
-
[17]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964, 2023
Rui Zheng, Shihan Dou, Songyang Gao, Yuanhua Zhou, Weizhe Xu, Da Yin, Wei Shen, Quanquan Hu, Yijia Liu, Zhiheng Xi, Zhan Chen, Xiaoran Fan, Pan Cao, Siqi Huang, Yuhao Zhang, Xiangpeng Cui, Yixuan Cheng, Hang Zhao, Yuchen Yao, Hao Zhou, Caixia Xu, Zhengyan Li, Maosong He, Xuanjing Huang, Xipeng Qiu, and Tao Gui. Secrets of rlhf in large language models par...
-
[19]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023
-
[21]
REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization
Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2502.10325 , year=
Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025
-
[24]
Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025
-
[25]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020
work page 1901
-
[26]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Xu Ma, Rui Li, Hao Xia, Jingjing Xu, Zhifang Wu, Baobao Chang, Xu Sun, Zhifang Li, and Zhifang Sui. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. URLhttps://arxiv.org/abs/2301.00234
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Shiguang Wu, Yaqing Wang, and Quanming Yao. Why in-context learning models are good few-shot learners? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=iLUcsecZJp
work page 2025
-
[28]
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114. Association for Computational Linguistics, 2022. doi: 10.18653...
-
[29]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064. Association for Computational Linguistics, 2022. doi: 10.18653/v1/...
-
[30]
Learning to retrieve prompts for in-context learning
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671. Association for Computational Linguistics,
work page 2022
-
[31]
URL https://aclanthology.org/2022.naacl-main
doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main. 191/
-
[32]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Teaching models to express their uncertainty in words,
Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words,
- [34]
-
[35]
How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026
Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, and Petar Velickovic. How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026. URL https://arxiv.org/abs/2603.17839
-
[36]
Uncertainty quantification for in-context learning of large language models
Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, Guangji Bai, Liang Zhao, and Haifeng Chen. Uncertainty quantification for in-context learning of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...
work page 2024
-
[37]
Putra Manggala, Atalanti A Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, and Aaditya Ramdas
Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey.arXiv preprint arXiv:2503.15850, 2025
-
[38]
Looking inward: Language models can learn about themselves by introspection
Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. InInternational Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=eb5pkwIB5i
work page 2025
-
[39]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Understanding r1-zero-like training: A critical perspective, 2025
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503. 20783
work page 2025
-
[41]
Liu, and Balaji Lakshminarayanan
Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models, 2023. URLhttps://arxiv.org/abs/2312.09300
-
[42]
Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950
Glenn W Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950
work page 1950
-
[43]
American Invitational Mathematics Examination (AIME)
Mathematical Association of America. American Invitational Mathematics Examination (AIME). https: //maa.org/student-programs/amc/, 2024
work page 2024
-
[44]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advances in Neural Information Processing Systems, 34:7294–7306, 2021
work page 2021
-
[45]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022
work page 2022
-
[46]
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. URLhttps://arxiv.org/abs/2506.05817
-
[47]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Patil, Ion Stoica, and Joseph E
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html, 2024
work page 2024
-
[52]
Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Llama Team. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URLhttps://arxiv.org/abs/2309.06180. 13 Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning (Appendix) Appendix Contents A Add...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Sum of digits from 1 to 9: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45
-
[57]
10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900
Sum of digits from 10 to 99: Each digit from 0 to 9 appears 10 times in the tens place and 10 times in the units place. 10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900
-
[58]
100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500
Sum of digits from 100 to 999: Each digit from 0 to 9 appears 100 times in the hundreds place, 100 times in the tens place, and 100 times in the units place. 100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500
-
[59]
Sum of digits of 1000: 1 + 0 + 0 + 0 = 1 Adding these together, we get: 45 + 900 + 13500 + 1 = 14446 Now, we need to find the remainder when 14446 is divided by 9. We can do this by summing the digits of 14446: 1 + 4 + 4 + 4 + 6 = 19 1 + 9 = 10 1 + 0 = 1 So, the remainder when 14446 is divided by 9 is 1. Therefore, the remainder when the 1000-digit number...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.