arxiv: 2605.11235 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

Han Zheng , Yining Ma , Karthick Gunasekaran , Bharathan Balaji , Zheng Du , Shiv Vitaladevuni , Cathy Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement fine-tuningcurriculum learningLLMself-judgmentreward variancein-context learningmetacognition

0 comments

The pith

A language model can learn to judge which training prompts will help it most by predicting its own reward variance from recent examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that external rules or separate models for picking training prompts in reinforcement fine-tuning of large language models can be replaced by an internal process. The model notices that the spread of rewards across examples inside one prompt signals how useful that prompt is, then uses its recent training history as examples to forecast this spread for new prompts. It also receives a reward for making accurate forecasts, so the same optimization loop improves both task performance and the ability to choose what to learn next. This closed loop produces stronger results and much quicker progress on math, coding, and agent tasks without needing hand-designed curricula.

Core claim

METIS internalizes curriculum judgment as a native capability of the policy. It uses within-prompt reward variance as a gauge of prompt informativeness, predicts this variance from recent training outcomes treated as in-context examples, and jointly optimizes the standard RFT objective together with an additional self-judgment reward so the policy learns what to learn next.

What carries the argument

METIS self-judgment mechanism: the policy predicts within-prompt reward variance from recent training history as in-context examples and receives a joint reward for both task success and accurate judgment.

If this is right

The policy reaches higher performance on mathematical reasoning, code generation, and agentic function-calling benchmarks.
Training converges up to 67 percent faster than methods that rely on handcrafted heuristics or auxiliary models.
Curriculum decisions become aligned directly with the policy's evolving training dynamics.
The model acquires an internal ability to decide what to learn next without external guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-based self-assessment could be tested in other reward-driven training settings where example ordering matters.
Jointly training judgment and task performance may reduce the engineering effort needed to maintain separate curriculum modules.
If the approach scales, future models might handle their own data selection with less human-specified structure.

Load-bearing premise

Reward variance inside a prompt reliably indicates how informative that prompt is for training, and the policy can accurately predict this variance from its own recent outcomes while learning to judge itself at the same time.

What would settle it

Running the same benchmarks with METIS and with standard external curriculum methods, then finding no difference in final performance or convergence speed, would show the internalized judgment adds nothing.

Figures

Figures reproduced from arXiv: 2605.11235 by Bharathan Balaji, Cathy Wu, Han Zheng, Karthick Gunasekaran, Shiv Vitaladevuni, Yining Ma, Zheng Du.

**Figure 2.** Figure 2: Overview of METIS. At each iteration, the policy predicts candidate informativeness vˆθ(x) via in-context learning on a calibration memory of recent prompt-variance pairs (left). The most informative prompts are then rolled out (middle) to yield task rewards and realized variance v(x). The policy is jointly optimized (right) via standard task loss on solution tokens and a self-judgment loss on prediction t… view at source ↗

**Figure 3.** Figure 3: Training dynamics of METIS. Top: downstream pass@1 versus wall-clock training time. Bottom: mean magnitude of the group-relative advantage |A| per training step. METIS reaches higher pass@1 earlier while sustaining a larger per-step learning signal than all baselines. 0 20 40 60 80 100 Training step 0.0 0.2 0.4 0.6 0.8 1.0 Avg rollout reward 0 2 4 6 Training time (hours) 0.55 0.60 0.65 MATH-500 pass@1 No C… view at source ↗

**Figure 4.** Figure 4: The average rollout reward and the corresponding training performance (pass@1) curve, on a [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Compute overhead of curriculum methods, measured against No Curriculum. Left: wall-clock overhead. Right: per-step throughput drop. Lower is better on both axes. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of the joint judgment loss Ljudge. Left: self-judgment reward Rjudge rises during training, indicating that the policy is learning to self-judge accurately. Middle, Right: mean and std of vˆθ(x) over the candidate pool; removing Ljudge flattens the mean and shrinks the spread. 5.3 Ablation study on In-Context Learning and the Self-Judgment Loss Effects of in-context evidence (w/o ICL). Tab. 2 compar… view at source ↗

**Figure 7.** Figure 7: Validation pass@1 vs. wall-clock training time for Llama-3.1-8B-Instruct trained on DAPO [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Validation pass@1 vs. wall-clock training time for Qwen3-8B-Base trained on [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Validation pass@1 vs. wall-clock training time for DeepSeek-R1-Distill-Llama-8B trained [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Effect of the judgment loss weight λ on the policy’s pre-rollout predictions vˆθ(x) over the candidate pool. Left: mean of vˆθ(x); Right: standard deviation across the pool. λ = 0: predictions stay stationary and undifferentiated. λ = 0.01: mean rises with the policy’s competence and crosspool variance is preserved. λ = 1: the loss dominates, training collapses, and vˆθ(x) saturates at 0.25. 87.6% and av… view at source ↗

read the original abstract

In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted heuristics or auxiliary models, risking misalignment with the policy's training dynamics. In this paper, we introduce METIS (METacognitive Internalized Self-judgment), a novel framework that internalizes curriculum judgment as a native capability. Leveraging a critical observation that within-prompt reward variance effectively gauges prompt informativeness, METIS predicts this metric based on recent training outcomes as lightweight in-context learning examples. This intrinsic self-judgment then dynamically dictates the training allocation. Moreover, METIS closes the loop between judgment and optimization by jointly optimizing the standard RFT rewards and a self-judgment reward. This allows the policy to learn what to learn next, as a form of metacognition. Across extensive discrete and continuous RFT benchmarks from mathematical reasoning, code generation, to agentic function-calling, METIS consistently delivers superior performance while accelerating convergence by up to 67%. By bypassing handcrafted heuristics and auxiliary models, our work establishes a simple, closed-loop, and highly efficient curriculum internalization paradigm for LLM reinforcement fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

METIS internalizes curriculum judgment in LLM RFT via reward-variance prediction and joint self-reward optimization, but the abstract supplies no validation that variance actually tracks informativeness.

read the letter

The main point is that this paper introduces METIS to let the policy itself decide training order in reinforcement fine-tuning. It does so by treating within-prompt reward variance as a signal of how useful a prompt is, predicting that variance from recent outcomes shown as in-context examples, and then jointly optimizing both the usual RFT objective and an extra self-judgment reward. The abstract says this produces better final performance and up to 67% faster convergence on math, code, and agent benchmarks while removing hand-crafted heuristics and auxiliary models. That closed-loop idea is the clearest novelty relative to the curriculum-learning work it cites. If the full experiments include proper ablations and external baselines, the approach could reduce reliance on external judgment modules that often misalign with the policy's actual learning state. The results claims are the part that would interest people running large-scale RFT pipelines. The soft spot is exactly the one the stress-test note flags: the paper treats the variance-informativeness link as given without showing any correlation analysis, comparison to other signals such as gradient norms, or checks against reward saturation. The joint optimization also creates a potential circularity that is not broken by an independent benchmark in the abstract. Without equations or experimental details, it is impossible to tell whether the reported gains come from the internalized judgment or simply from extra training signal. This is aimed at researchers working on efficient LLM fine-tuning and curriculum methods. A reader already thinking about metacognition or self-supervised signals in RL might pick up useful ideas to test. It deserves a serious referee so the full paper can be checked for ablations, stability of the joint objective, and whether the variance prediction generalizes beyond the in-context examples. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces METIS (METacognitive Internalized Self-judgment), a framework for LLM reinforcement fine-tuning (RFT) that internalizes curriculum judgment. It uses within-prompt reward variance as a proxy for prompt informativeness, predicts this variance via in-context learning from recent training outcomes, and jointly optimizes the policy with both standard RFT rewards and a self-judgment reward to enable dynamic training allocation without external heuristics or auxiliary models. The work claims consistent superiority and up to 67% faster convergence across discrete and continuous benchmarks in mathematical reasoning, code generation, and agentic function-calling.

Significance. If the core claims hold after validation, the approach would be significant for RFT by providing a simple closed-loop alternative to handcrafted curricula, potentially improving alignment and efficiency. The metacognitive internalization idea is conceptually novel and could reduce reliance on external components, but the absence of supporting analysis for the variance proxy and experimental details currently limits its assessed impact.

major comments (3)

[Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.
[§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.
[Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.

minor comments (2)

[Abstract] Abstract: The METIS acronym expansion is given but could be introduced more explicitly at first use for clarity.
[§3] Notation: The description of in-context examples from 'recent training outcomes' would benefit from a precise definition or pseudocode in the methods to avoid ambiguity in implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that additional supporting analyses and experimental details will strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The foundational claim that within-prompt reward variance effectively gauges prompt informativeness is asserted without preliminary correlation analysis, comparison to alternatives (e.g., reward magnitude or gradient norm), or validation against actual learning progress, leaving the curriculum decision mechanism ungrounded.

Authors: We agree that the manuscript would benefit from explicit preliminary validation of within-prompt reward variance as a proxy for prompt informativeness. While the paper presents this as a critical observation motivated by RFT dynamics, we did not include dedicated correlation studies or comparisons in the initial version. In the revised manuscript, we will add a subsection to §3 containing correlation analysis between reward variance and learning progress metrics (e.g., policy improvement on held-out prompts), along with direct comparisons to alternatives such as average reward magnitude and gradient norms. These will include quantitative coefficients and visualizations to better ground the curriculum decision mechanism. revision: yes
Referee: [§4] §4 (joint optimization): The closed-loop design jointly optimizes standard RFT rewards with the self-judgment reward, but no independent external benchmark or ablation is shown to isolate the accuracy of the variance prediction from the optimization process it influences, raising circularity concerns for the metacognitive signal.

Authors: The potential for circularity in the joint optimization is a fair concern. The variance prediction uses in-context learning from prior training outcomes, which are generated before the current optimization step, providing temporal separation. Nevertheless, to isolate the prediction accuracy, the revised version will include a dedicated ablation in §4 and the Experiments section. This will evaluate the ICL variance predictor independently on held-out prompts, comparing predictions against ground-truth informativeness derived from external metrics or oracle learning progress, decoupled from the joint training objective. We expect this to confirm that the metacognitive signal contributes meaningfully beyond optimization artifacts. revision: yes
Referee: [Experiments] Experiments section: Performance superiority and convergence acceleration claims (up to 67%) are stated without reported baselines, number of runs, error bars, statistical significance tests, or ablation studies on the ICL prediction component, making it impossible to verify whether the data support the gains over external heuristics.

Authors: We acknowledge that the experimental reporting was incomplete in the submitted manuscript. The experiments compare against standard RFT and external heuristic baselines as described in §5, but omitted run counts, variability measures, and statistical tests. In the revision, we will specify the number of independent runs, add error bars to all tables and convergence plots, and include statistical significance tests (e.g., paired t-tests) to support the reported performance gains and convergence acceleration. We will also add ablations isolating the ICL prediction component to demonstrate its contribution relative to the full METIS framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core method rests on an empirical observation that within-prompt reward variance gauges informativeness, followed by ICL-based prediction of that metric from recent outcomes and joint optimization of RFT rewards plus a self-judgment reward. No equations, derivations, or self-citations are provided that reduce any claimed prediction or result to its inputs by construction. The closed-loop design is presented as a deliberate architectural choice for metacognition rather than a tautological equivalence, and performance claims are grounded in external benchmarks across tasks rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; full paper may contain additional fitted parameters or unstated assumptions.

axioms (1)

domain assumption Within-prompt reward variance effectively gauges prompt informativeness
Presented as the critical observation that enables the self-judgment mechanism.

invented entities (1)

METIS framework no independent evidence
purpose: Internalize curriculum judgment as a native policy capability
Newly introduced method whose independent validation is not provided in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1311 out tokens · 76916 ms · 2026-05-13T02:14:21.889526+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 19 internal anchors

[1]

Harvard University Press, 1978

Lev S Vygotsky.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, 1978

work page 1978
[2]

John H. Flavell. Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34(10):906–911, 1979

work page 1979
[3]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[4]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Vcrl: Variance-based curriculum reinforcement learning for large language models

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models, 2025. URL https://arxiv.org/abs/2509.19803

work page arXiv 2025
[9]

Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for llm reinforcement fine-tuning through difficulty-targeted online data selection and rollout replay, 2026. URLhttps://arxiv.org/abs/2506.05316

work page arXiv 2026
[10]

Prompt curriculum learning for efficient LLM post-training

Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient LLM post-training. InThe Fourteenth International Confer- ence on Learning Representations, 2026. URLhttps://openreview.net/forum?id=zqOCacBD3P

work page 2026
[11]

DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025

Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, and Wentian Zhao. DUMP: Automated distribution- level curriculum learning for RL-based LLM post-training.arXiv preprint arXiv:2504.09710, 2025

work page arXiv 2025
[12]

Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[13]

Learning like humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation

Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, and Lu Qianchun. Learning like humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6619–6633. Association for Computational Linguistics...

work page 2025
[14]

Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, and Shuiwang Ji. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning, 2026. URLhttps://arxiv.org/abs/2506.06632

work page arXiv 2026
[15]

Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025

Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad llm reasoning from math, 2025. URLhttps://arxiv.org/abs/2510.26143

work page arXiv 2025
[16]

Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts.arXiv preprint arXiv:2506.02177,

Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. arXiv preprint arXiv:2506.02177, 2025

work page arXiv 2025
[17]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964, 2023

Rui Zheng, Shihan Dou, Songyang Gao, Yuanhua Zhou, Weizhe Xu, Da Yin, Wei Shen, Quanquan Hu, Yijia Liu, Zhiheng Xi, Zhan Chen, Xiaoran Fan, Pan Cao, Siqi Huang, Yuhao Zhang, Xiangpeng Cui, Yixuan Cheng, Hang Zhao, Yuchen Yao, Hao Zhou, Caixia Xu, Zhengyan Li, Maosong He, Xuanjing Huang, Xipeng Qiu, and Tao Gui. Secrets of rlhf in large language models par...

work page arXiv 2023
[19]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023

work page arXiv 2023
[21]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimiza- tion with global advantage normalization, 2025. URLhttps://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2502.10325 , year=

Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions.arXiv preprint arXiv:2502.10325, 2025

work page arXiv 2025
[24]

Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025

Ruiqi Zhang, Daman Arora, Song Mei, and Andrea Zanette. Speed-rl: Faster training of reasoning models via online curriculum learning.arXiv preprint arXiv:2506.09016, 2025

work page arXiv 2025
[25]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[26]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Xu Ma, Rui Li, Hao Xia, Jingjing Xu, Zhifang Wu, Baobao Chang, Xu Sun, Zhifang Li, and Zhifang Sui. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023. URLhttps://arxiv.org/abs/2301.00234

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Why in-context learning models are good few-shot learners? InThe Thirteenth International Conference on Learning Representations, 2025

Shiguang Wu, Yaqing Wang, and Quanming Yao. Why in-context learning models are good few-shot learners? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=iLUcsecZJp

work page 2025
[28]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114. Association for Computational Linguistics, 2022. doi: 10.18653...

work page doi:10.18653/v1/2022.deelio-1.10 2022
[29]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064. Association for Computational Linguistics, 2022. doi: 10.18653/v1/...

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[30]

Learning to retrieve prompts for in-context learning

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671. Association for Computational Linguistics,

work page 2022
[31]

URL https://aclanthology.org/2022.naacl-main

doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main. 191/

work page doi:10.18653/v1/2022.naacl-main.191 2022
[32]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Teaching models to express their uncertainty in words,

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words,

work page
[34]

URLhttps://arxiv.org/abs/2205.14334

work page arXiv
[35]

How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, and Petar Velickovic. How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839, 2026. URL https://arxiv.org/abs/2603.17839

work page arXiv 2026
[36]

Uncertainty quantification for in-context learning of large language models

Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, Guangji Bai, Liang Zhao, and Haifeng Chen. Uncertainty quantification for in-context learning of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguisti...

work page 2024
[37]

Putra Manggala, Atalanti A Mastakouri, Elke Kirschbaum, Shiva Kasiviswanathan, and Aaditya Ramdas

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey.arXiv preprint arXiv:2503.15850, 2025

work page arXiv 2025
[38]

Looking inward: Language models can learn about themselves by introspection

Felix Jedidja Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. InInternational Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=eb5pkwIB5i

work page 2025
[39]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Understanding r1-zero-like training: A critical perspective, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https://arxiv.org/abs/2503. 20783

work page 2025
[41]

Liu, and Balaji Lakshminarayanan

Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models, 2023. URLhttps://arxiv.org/abs/2312.09300

work page arXiv 2023
[42]

Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950

Glenn W Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1): 1–3, 1950

work page 1950
[43]

American Invitational Mathematics Examination (AIME)

Mathematical Association of America. American Invitational Mathematics Examination (AIME). https: //maa.org/student-programs/amc/, 2024

work page 2024
[44]

Measuring mathematical problem solving with the MATH dataset.Advances in Neural Information Processing Systems, 34:7294–7306, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advances in Neural Information Processing Systems, 34:7294–7306, 2021

work page 2021
[45]

Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems, 35:3843–3857, 2022

work page 2022
[46]

CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. Codecontests+: High-quality test case generation for competitive programming, 2025. URLhttps://arxiv.org/abs/2506.05817

work page arXiv 2025
[47]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. 12

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. BigCodeBench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877, 2024

work page internal anchor Pith review arXiv 2024
[51]

Patil, Ion Stoica, and Joseph E

Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html, 2024

work page 2024
[52]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

The Llama 3 Herd of Models

Llama Team. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URLhttps://arxiv.org/abs/2309.06180. 13 Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning (Appendix) Appendix Contents A Add...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Sum of digits from 1 to 9: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45

work page
[57]

10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900

Sum of digits from 10 to 99: Each digit from 0 to 9 appears 10 times in the tens place and 10 times in the units place. 10 * (0+1+2+3+4+5+6+7+8+9) + 10 * (0+1+2+3+4+5+6+7+8+9) = 10 * 45 + 10 * 45 = 900

work page
[58]

100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500

Sum of digits from 100 to 999: Each digit from 0 to 9 appears 100 times in the hundreds place, 100 times in the tens place, and 100 times in the units place. 100 * (1+...+9) + 100 * (0+...+9) + 100 * (0+...+9) = 100 * 45 + 100 * 45 + 100 * 45 = 13500

work page
[59]

We can do this by summing the digits of 14446: 1 + 4 + 4 + 4 + 6 = 19 1 + 9 = 10 1 + 0 = 1 So, the remainder when 14446 is divided by 9 is 1

Sum of digits of 1000: 1 + 0 + 0 + 0 = 1 Adding these together, we get: 45 + 900 + 13500 + 1 = 14446 Now, we need to find the remainder when 14446 is divided by 9. We can do this by summing the digits of 14446: 1 + 4 + 4 + 4 + 6 = 19 1 + 9 = 10 1 + 0 = 1 So, the remainder when 14446 is divided by 9 is 1. Therefore, the remainder when the 1000-digit number...

work page 2024