pith. machine review for the scientific record. sign in

arxiv: 2605.11235 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement fine-tuningcurriculum learningLLMself-judgmentreward variancein-context learningmetacognition
0
0 comments X

The pith

A language model can learn to judge which training prompts will help it most by predicting its own reward variance from recent examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that external rules or separate models for picking training prompts in reinforcement fine-tuning of large language models can be replaced by an internal process. The model notices that the spread of rewards across examples inside one prompt signals how useful that prompt is, then uses its recent training history as examples to forecast this spread for new prompts. It also receives a reward for making accurate forecasts, so the same optimization loop improves both task performance and the ability to choose what to learn next. This closed loop produces stronger results and much quicker progress on math, coding, and agent tasks without needing hand-designed curricula.

Core claim

METIS internalizes curriculum judgment as a native capability of the policy. It uses within-prompt reward variance as a gauge of prompt informativeness, predicts this variance from recent training outcomes treated as in-context examples, and jointly optimizes the standard RFT objective together with an additional self-judgment reward so the policy learns what to learn next.

What carries the argument

METIS self-judgment mechanism: the policy predicts within-prompt reward variance from recent training history as in-context examples and receives a joint reward for both task success and accurate judgment.

If this is right

  • The policy reaches higher performance on mathematical reasoning, code generation, and agentic function-calling benchmarks.
  • Training converges up to 67 percent faster than methods that rely on handcrafted heuristics or auxiliary models.
  • Curriculum decisions become aligned directly with the policy's evolving training dynamics.
  • The model acquires an internal ability to decide what to learn next without external guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-based self-assessment could be tested in other reward-driven training settings where example ordering matters.
  • Jointly training judgment and task performance may reduce the engineering effort needed to maintain separate curriculum modules.
  • If the approach scales, future models might handle their own data selection with less human-specified structure.

Load-bearing premise

Reward variance inside a prompt reliably indicates how informative that prompt is for training, and the policy can accurately predict this variance from its own recent outcomes while learning to judge itself at the same time.

What would settle it

Running the same benchmarks with METIS and with standard external curriculum methods, then finding no difference in final performance or convergence speed, would show the internalized judgment adds nothing.

Figures

Figures reproduced from arXiv: 2605.11235 by Bharathan Balaji, Cathy Wu, Han Zheng, Karthick Gunasekaran, Shiv Vitaladevuni, Yining Ma, Zheng Du.

Figure 1
Figure 1. Figure 1: Conceptual comparison. While existing curricula rely on external schedules, heuristics, or [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of METIS. At each iteration, the policy predicts candidate informativeness vˆθ(x) via in-context learning on a calibration memory of recent prompt-variance pairs (left). The most informative prompts are then rolled out (middle) to yield task rewards and realized variance v(x). The policy is jointly optimized (right) via standard task loss on solution tokens and a self-judgment loss on prediction t… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of METIS. Top: downstream pass@1 versus wall-clock training time. Bottom: mean magnitude of the group-relative advantage |A| per training step. METIS reaches higher pass@1 earlier while sustaining a larger per-step learning signal than all baselines. 0 20 40 60 80 100 Training step 0.0 0.2 0.4 0.6 0.8 1.0 Avg rollout reward 0 2 4 6 Training time (hours) 0.55 0.60 0.65 MATH-500 pass@1 No C… view at source ↗
Figure 4
Figure 4. Figure 4: The average rollout reward and the corresponding training performance (pass@1) curve, on a [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Compute overhead of curriculum methods, measured against No Curriculum. Left: wall-clock overhead. Right: per-step throughput drop. Lower is better on both axes. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the joint judgment loss Ljudge. Left: self-judgment reward Rjudge rises during training, indicating that the policy is learning to self-judge accurately. Middle, Right: mean and std of vˆθ(x) over the candidate pool; removing Ljudge flattens the mean and shrinks the spread. 5.3 Ablation study on In-Context Learning and the Self-Judgment Loss Effects of in-context evidence (w/o ICL). Tab. 2 compar… view at source ↗
Figure 7
Figure 7. Figure 7: Validation pass@1 vs. wall-clock training time for Llama-3.1-8B-Instruct trained on DAPO [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation pass@1 vs. wall-clock training time for Qwen3-8B-Base trained on [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Validation pass@1 vs. wall-clock training time for DeepSeek-R1-Distill-Llama-8B trained [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the judgment loss weight λ on the policy’s pre-rollout predictions vˆθ(x) over the candidate pool. Left: mean of vˆθ(x); Right: standard deviation across the pool. λ = 0: predictions stay stationary and undifferentiated. λ = 0.01: mean rises with the policy’s competence and cross￾pool variance is preserved. λ = 1: the loss dominates, training collapses, and vˆθ(x) saturates at 0.25. 87.6% and av… view at source ↗
read the original abstract

In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces METIS (METacognitive Internalized Self-judgment), a framework for LLM reinforcement fine-tuning (RFT) that internalizes curriculum judgment. It uses within-prompt reward variance as a proxy for prompt informativeness, predicts this variance via in-context learning from recent training outcomes, and jointly optimizes the policy with both standard RFT rewards and a self-judgment reward to enable dynamic training allocation without external heuristics or auxiliary models. The work claims consistent superiority and up to 67% faster convergence across discrete and continuous benchmarks in mathematical reasoning, code generation, and agentic function-calling.

Significance. If the core claims hold after validation, the approach would be significant for RFT by providing a simple closed-loop alternative to handcrafted curricula, potentially improving alignment and efficiency. The metacognitive internalization idea is conceptually novel and could reduce reliance on external components, but the absence of supporting analysis for the variance proxy and experimental details currently limits its assessed impact.

major comments (3)
  1. [Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.
  2. [§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.
  3. [Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.
minor comments (2)
  1. [Abstract] Abstract: The METIS acronym expansion is given but could be introduced more explicitly at first use for clarity.
  2. [§3] Notation: The description of in-context examples from 'recent training outcomes' would benefit from a precise definition or pseudocode in the methods to avoid ambiguity in implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that additional supporting analyses and experimental details will strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.

    Authors: We agree that the manuscript would benefit from explicit preliminary validation of within-prompt reward variance as a proxy for prompt informativeness. While the paper presents this as a critical observation motivated by RFT dynamics, we did not include dedicated correlation studies or comparisons in the initial version. In the revised manuscript, we will add a subsection to §3 containing correlation analysis between reward variance and learning progress metrics (e.g., policy improvement on held-out prompts), along with direct comparisons to alternatives such as average reward magnitude and gradient norms. These will include quantitative coefficients and visualizations to better ground the curriculum decision mechanism. revision: yes

  2. Referee: [§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.

    Authors: The potential for circularity in the joint optimization is a fair concern. The variance prediction uses in-context learning from prior training outcomes, which are generated before the current optimization step, providing temporal separation. Nevertheless, to isolate the prediction accuracy, the revised version will include a dedicated ablation in §4 and the Experiments section. This will evaluate the ICL variance predictor independently on held-out prompts, comparing predictions against ground-truth informativeness derived from external metrics or oracle learning progress, decoupled from the joint training objective. We expect this to confirm that the metacognitive signal contributes meaningfully beyond optimization artifacts. revision: yes

  3. Referee: [Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.

    Authors: We acknowledge that the experimental reporting was incomplete in the submitted manuscript. The experiments compare against standard RFT and external heuristic baselines as described in §5, but omitted run counts, variability measures, and statistical tests. In the revision, we will specify the number of independent runs, add error bars to all tables and convergence plots, and include statistical significance tests (e.g., paired t-tests) to support the reported performance gains and convergence acceleration. We will also add ablations isolating the ICL prediction component to demonstrate its contribution relative to the full METIS framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core method rests on an empirical observation that within-prompt reward variance gauges informativeness, followed by ICL-based prediction of that metric from recent outcomes and joint optimization of RFT rewards plus a self-judgment reward. No equations, derivations, or self-citations are provided that reduce any claimed prediction or result to its inputs by construction. The closed-loop design is presented as a deliberate architectural choice for metacognition rather than a tautological equivalence, and performance claims are grounded in external benchmarks across tasks rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; full paper may contain additional fitted parameters or unstated assumptions.

axioms (1)
  • domain assumption Within-prompt reward variance effectively gauges prompt informativeness
    Presented as the critical observation that enables the self-judgment mechanism.
invented entities (1)
  • METIS framework no independent evidence
    purpose: Internalize curriculum judgment as a native policy capability
    Newly introduced method whose independent validation is not provided in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1311 out tokens · 76916 ms · 2026-05-13T02:14:21.889526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 19 internal anchors

  1. [1]

    Harvard University Press, 1978

    Lev S Vygotsky.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, 1978

  2. [2]

    John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979

  3. [3]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  4. [4]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  6. [6]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  7. [7]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  8. [8]

    Vcrl: Variance-based curriculum reinforcement learning for large language models

    Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models, 2025. URL https://arxiv.org/abs/2509.19803

  9. [9]

    Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026. URLhttps://arxiv.org/abs/2506.05316

  10. [10]

    Prompt curriculum learning for efficient LLM post-training

    Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient LLM post-training. InThe Fourteenth International Confer- ence on Learning Representations, 2026. URLhttps://openreview.net/forum?id=zqOCacBD3P

  11. [11]

    DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025

    Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025

  12. [12]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

  13. [13]

    Learning like humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation

    Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, and Lu Qianchun. Learning like humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6619–6633. Association for Computational Linguistics...

  14. [14]

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning, 2026. URLhttps://arxiv.org/abs/2506.06632

  15. [15]

    Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025

    Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025. URLhttps://arxiv.org/abs/2510.26143

  16. [16]

    Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025

  17. [17]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  18. [18]

    Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964, 2023

    Rui Zheng, Shihan Dou, Songyang Gao, Yuanhua Zhou, Weizhe Xu, Da Yin, Wei Shen, Quanquan Hu, Yijia Liu, Zhiheng Xi, Zhan Chen, Xiaoran Fan, Pan Cao, Siqi Huang, Yuhao Zhang, Xiangpeng Cui, Yixuan Cheng, Hang Zhao, Yuchen Yao, Hao Zhou, Caixia Xu, Zhengyan Li, Maosong He, Xuanjing Huang, Xipeng Qiu, and Tao Gui. Secrets of rlhf in large language models par...

  19. [19]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

  20. [20]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

  21. [21]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262

  22. [22]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

  23. [23]

    arXiv preprint arXiv:2502.10325 , year=

    Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025

  24. [24]

    Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025

    Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025

  25. [25]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  26. [26]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Xu Ma, Rui Li, Hao Xia, Jingjing Xu, Zhifang Wu, Baobao Chang, Xu Sun, Zhifang Li, and Zhifang Sui. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. URLhttps://arxiv.org/abs/2301.00234

  27. [27]

    Why in-context learning models are good few-shot learners? InThe Thirteenth International Conference on Learning Representations, 2025

    Shiguang Wu, Yaqing Wang, and Quanming Yao. Why in-context learning models are good few-shot learners? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=iLUcsecZJp

  28. [28]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114. Association for Computational Linguistics, 2022. doi: 10.18653...

  29. [29]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064. Association for Computational Linguistics, 2022. doi: 10.18653/v1/...

  30. [30]

    Learning to retrieve prompts for in-context learning

    Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671. Association for Computational Linguistics,

  31. [31]

    URL https://aclanthology.org/2022.naacl-main

    doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main. 191/

  32. [32]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  33. [33]

    Teaching models to express their uncertainty in words,

    Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words,

  34. [34]

    URLhttps://arxiv.org/abs/2205.14334

  35. [35]

    How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026

    Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, and Petar Velickovic. How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026. URL https://arxiv.org/abs/2603.17839

  36. [36]

    Uncertainty quantification for in-context learning of large language models

    Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, Guangji Bai, Liang Zhao, and Haifeng Chen. Uncertainty quantification for in-context learning of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...

  37. [37]

    Putra Manggala, Atalanti A Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, and Aaditya Ramdas

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey.arXiv preprint arXiv:2503.15850, 2025

  38. [38]

    Looking inward: Language models can learn about themselves by introspection

    Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. InInternational Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=eb5pkwIB5i

  39. [39]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  40. [40]

    Understanding r1-zero-like training: A critical perspective, 2025

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503. 20783

  41. [41]

    Liu, and Balaji Lakshminarayanan

    Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models, 2023. URLhttps://arxiv.org/abs/2312.09300

  42. [42]

    Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950

    Glenn W Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950

  43. [43]

    American Invitational Mathematics Examination (AIME)

    Mathematical Association of America. American Invitational Mathematics Examination (AIME). https: //maa.org/student-programs/amc/, 2024

  44. [44]

    Measuring mathematical problem solving with the MATH dataset.Advances in Neural Information Processing Systems, 34:7294–7306, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advances in Neural Information Processing Systems, 34:7294–7306, 2021

  45. [45]

    Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  46. [46]

    CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

    Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. URLhttps://arxiv.org/abs/2506.05817

  47. [47]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  48. [48]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  49. [49]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12

  50. [50]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

  51. [51]

    Patil, Ion Stoica, and Joseph E

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html, 2024

  52. [52]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  53. [53]

    The Llama 3 Herd of Models

    Llama Team. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  54. [54]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256, 2024

  55. [55]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URLhttps://arxiv.org/abs/2309.06180. 13 Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning (Appendix) Appendix Contents A Add...

  56. [56]

    Sum of digits from 1 to 9: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45

  57. [57]

    10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900

    Sum of digits from 10 to 99: Each digit from 0 to 9 appears 10 times in the tens place and 10 times in the units place. 10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900

  58. [58]

    100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500

    Sum of digits from 100 to 999: Each digit from 0 to 9 appears 100 times in the hundreds place, 100 times in the tens place, and 100 times in the units place. 100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500

  59. [59]

    We can do this by summing the digits of 14446: 1 + 4 + 4 + 4 + 6 = 19 1 + 9 = 10 1 + 0 = 1 So, the remainder when 14446 is divided by 9 is 1

    Sum of digits of 1000: 1 + 0 + 0 + 0 = 1 Adding these together, we get: 45 + 900 + 13500 + 1 = 14446 Now, we need to find the remainder when 14446 is divided by 9. We can do this by summing the digits of 14446: 1 + 4 + 4 + 4 + 6 = 19 1 + 9 = 10 1 + 0 = 1 So, the remainder when 14446 is divided by 9 is 1. Therefore, the remainder when the 1000-digit number...