Hide to Guide: Learning via Semantic Masking

Alex Hu; Han Cai; Luke J. Huang; Qinghao Hu; Ruitao Liu; Shang Yang; Song Han; Yecheng Wu; Zhuoyang Zhang

arxiv: 2605.25198 · v1 · pith:ELYC5DAXnew · submitted 2026-05-24 · 💻 cs.LG · cs.AI

Hide to Guide: Learning via Semantic Masking

Ruitao Liu , Qinghao Hu , Alex Hu , Yecheng Wu , Shang Yang , Luke J. Huang , Zhuoyang Zhang , Han Cai

show 1 more author

Song Han

This is my paper

Pith reviewed 2026-06-30 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningsemantic maskingexpert guidancereward hacking preventionRLVRpolicy optimizationreasoning tasks

0 comments

The pith

Masking reward-relevant semantic spans in expert traces prevents reward hacking in RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Semantic Masked Expert Policy Optimization (SMEPO) to address limited exploration in reinforcement learning with verifiable rewards. By masking parts of expert traces that contain reward-relevant content like answers or code, while keeping the structure, the method turns learning into a fill-in-the-blank task. This prevents the policy from simply copying the trace to get reward. The approach improves accuracy and speeds up training across math, code, and search tasks without changing the reward or objective.

Core claim

SMEPO masks reward-relevant semantic spans along the critical path to the verifier target while preserving the expert's decomposition, plan, and procedural structure, turning hard problems into a fill-in-the-blank process where the policy follows the route but reconstructs missing content itself.

What carries the argument

Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy that hides reward-relevant content in expert traces for guided RLVR.

If this is right

Accuracy improves by up to 3.2 points over GRPO on diverse domains like math, code, and agentic search.
Training time reduces by up to 4.2x compared to baselines.
Requires no changes to the reward function or RL objective.
The policy learns underlying reasoning and agentic behavior rather than copying traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar masking could be applied to other forms of guidance in machine learning beyond RL.
Testing on a wider range of tasks might reveal if the method scales to more complex agent behaviors.
The approach might reduce the need for large amounts of expert data by making each trace more effective for learning.

Load-bearing premise

Selectively masking reward-relevant semantic spans still allows the policy to learn the underlying reasoning and behavior instead of reconstructing the masked content through other means.

What would settle it

A test where models trained with SMEPO achieve high rewards but fail to solve similar problems when expert traces are not provided or when masks are removed.

Figures

Figures reproduced from arXiv: 2605.25198 by Alex Hu, Han Cai, Luke J. Huang, Qinghao Hu, Ruitao Liu, Shang Yang, Song Han, Yecheng Wu, Zhuoyang Zhang.

**Figure 1.** Figure 1: Overview of SMEPO. On hard reasoning problems, a) vanilla RL often receives sparse rewards because the initial policy fails to discover useful solution trajectories. b) Full expert traces can provide useful trajectories, but may also reveal verifier-aligned content along the solution path, such as intermediate values, final answers, code blocks, or answer-related entities. RL conditioned on the full trace … view at source ↗

**Figure 2.** Figure 2: Training pipeline for SMEPO. We detect reward-relevant spans in the expert trace with domain-specific masking rules, replace those spans with mask placeholders, and then run the same GRPO loop conditioned on the original problem and the masked trace [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training dynamics and target-reaching efficiency. (Left): LiveCodeBench curve for Qwen3-8B-Base with DeepSeek-R1 traces. (Right): time to reach the GRPO peak mean@32 accuracy (averaged over AIME25 and AIME26). SMEPO reaches 50.3 with both expert sources, improving over vanilla GRPO by 2.7 points and over full-expert-trace conditioning by 9.3–9.5 points. The same trend holds across other models. On Qwen2.5-… view at source ↗

**Figure 4.** Figure 4: Code-copy diagnostics. Left: Visible-trace overlap captures copying from the expert content actually visible in the prompt. Right: Expert-code similarity captures reuse of the original expert program. The left figure does not include GRPO since it does not involve expert traces. Method GSM8K MATH500 AIME25 AIME26 AMC Olympiad Avg. Qwen3-8B-Base 68.0 67.0 9.9 7.6 50.0 25.8 38.1 GHPO 78.3 72.8 13.4 10.2 60.0… view at source ↗

**Figure 5.** Figure 5: Training curves with GHPO. Using the original GHPO setup, adding SMEPO improves the model accuracy on GSM8K and MATH500. Cat. Method GSM8K MATH500 AIME25 AIME26 AMC Olympiad Avg. Full Expert (100%) 76.4 69.2 11.4 8.6 52.5 27.7 41.0 Prefix Expert (75%) 86.1 75.0 14.5 11.6 60.0 29.7 46.1 Expert (50%) 85.2 72.6 13.5 10.3 57.5 28.8 44.7 Expert (25%) 87.1 74.6 14.2 11.0 65.0 30.6 47.1 Random Word 73.6 67.0 10.2… view at source ↗

**Figure 6.** Figure 6: Additional LiveCodeBench v6 subset evaluation. Left: Training dynamics; right: final pass@1 scores. We evaluate Qwen3-8B-Base on 131 problems released between 2025-02-01 and 2025-05-01. SMEPO performs best, consistent with the main code evaluation. A Additional Results A.1 Additional LiveCodeBench v6 subset evaluation We further evaluate Qwen3-8B-Base on a commonly used LiveCodeBench v6 subset, consisting … view at source ↗

**Figure 7.** Figure 7: Effect of partial expert-trace exposure on AIME26. Coarse prefix exposure reduces the harm of full-expert-trace conditioning, but its effect is non-monotonic and remains below SMEPO, suggesting that the key issue is not trace length alone. B.1 Code-copy diagnostics. For code tasks, the reward-relevant content is usually a complete executable solution rather than a short numeric answer or entity. We therefo… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert's problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit-han-lab/SMEPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMEPO adds a semantic masking layer on top of expert traces in RLVR to cut down on direct copying, but the reported gains are small and the evidence that it actually builds reasoning instead of pattern completion is still thin.

read the letter

The main thing here is a straightforward change to how expert traces are fed into RLVR: mask out the reward-critical bits (answers, code snippets, entities) while leaving the expert's overall plan and decomposition intact. That turns the problem into guided fill-in-the-blank rather than full generation or full copying. It is a finer-grained alternative to the usual partial-trajectory tricks, and the authors ship code, which is useful.

What works is the simplicity. No new loss terms or reward engineering, just a masking step before the policy sees the trace. The abstract claims up to 3.2 accuracy points over GRPO and 4.2x faster training across math, code, and agent tasks. If those numbers hold with decent controls, it is a practical knob people can turn.

The soft spot is exactly the one the stress-test flags. Keeping the expert structure means the policy still sees the decomposition and flow; nothing in the description shows an ablation where the same structure is paired with wrong but plausible masked content to test whether the model is actually reasoning or just completing templates. The accuracy lift is modest, and without seeing the full experimental details on baselines, variance, and whether the masking heuristic itself introduces selection effects, it is hard to know how much is real capability versus guided reconstruction.

This is for groups already running RLVR on reasoning models who want a low-overhead way to use expert data without obvious hacking. It is not a foundational shift, but the idea is clean enough and the code is out, so it deserves a serious referee who can check the experiments and the ablations.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Semantic Masked Expert Policy Optimization (SMEPO) for RLVR. SMEPO applies fine-grained semantic masking to reward-relevant spans (final answers, code, entities) in expert traces while preserving the expert's decomposition, plan, and procedural structure. This converts hard problems into a fill-in-the-blank task so the policy must reconstruct the masked content rather than copy it. The paper claims SMEPO yields accuracy gains of up to 3.2 points over GRPO and training-time reductions of up to 4.2x across math, code, and agentic-search domains, requires no reward or objective changes, and releases code at https://github.com/mit-han-lab/SMEPO.

Significance. If the reported gains are shown to arise from improved reasoning acquisition rather than pattern completion on preserved structure, SMEPO would supply a lightweight, reward-agnostic way to incorporate expert guidance in RLVR. The public code release is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the empirical claims (accuracy +3.2, time -4.2x) are stated without any description of experimental setup, statistical significance testing, baseline implementations, or controls for confounding factors such as masking heuristics and data selection. This absence directly undermines evaluation of the central claim.
[Abstract] Abstract (method and results): no ablation or control is described that pairs the retained expert decomposition with incorrect but plausible masked content. Without such a test it remains unclear whether the observed gains reflect genuine reasoning or reconstruction via domain priors and common solution templates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the empirical claims (accuracy +3.2, time -4.2x) are stated without any description of experimental setup, statistical significance testing, baseline implementations, or controls for confounding factors such as masking heuristics and data selection. This absence directly undermines evaluation of the central claim.

Authors: We agree that the abstract is concise and would benefit from additional context. In the revised version we will expand the abstract to briefly note the evaluation domains (math, code, agentic search), the main baseline (GRPO), and that results are averaged over multiple random seeds with reported statistical significance. Full descriptions of masking heuristics, data selection, and baseline implementations appear in Sections 3 and 4. Abstract length limits preclude exhaustive detail, but the added sentence will improve clarity without altering the central claim. revision: partial
Referee: [Abstract] Abstract (method and results): no ablation or control is described that pairs the retained expert decomposition with incorrect but plausible masked content. Without such a test it remains unclear whether the observed gains reflect genuine reasoning or reconstruction via domain priors and common solution templates.

Authors: We acknowledge the value of this specific control. Our existing ablations (Section 5) compare semantic masking against random masking and unmasked traces, showing that structure-preserving semantic masking is necessary for the observed gains. However, we did not include an explicit condition that replaces masked spans with incorrect but plausible content. We will add this experiment in the revision to directly test whether gains derive from reasoning acquisition rather than domain priors or template reconstruction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with no derivation chain

full rationale

The paper proposes SMEPO as a practical semantic masking strategy for expert-guided RLVR, turning hard problems into fill-in-the-blank via selective masking of reward-relevant spans while preserving structure. Claims of up to 3.2-point accuracy gains and 4.2x speedup over GRPO are presented purely as experimental outcomes across math, code, and agentic search domains, with no equations, derivations, or first-principles results shown. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or description. The approach requires no reward function changes and is supported by code release, making the central contribution self-contained empirical validation rather than a closed theoretical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is presented as an empirical masking technique.

pith-pipeline@v0.9.1-grok · 5862 in / 1005 out tokens · 32889 ms · 2026-06-30T12:13:56.174016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 41 canonical work pages · 25 internal anchors

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[3]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. REINFORCE++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example, 2025. URL https://arxiv.org/abs/2504.20571

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective, 2025. URL https://arxiv.org/abs/ 2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments, 2025. URL https://arxiv.org/abs/2504.03160

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Learn hard problems during RL with reference guided fine-tuning, 2026

Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, and Tianle Cai. Learn hard problems during RL with reference guided fine-tuning, 2026. URL https: //arxiv.org/abs/2603.01223

work page arXiv 2026
[9]

GHPO: Adaptive guidance for stable and efficient LLM reinforcement learning,

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. GHPO: Adaptive guidance for stable and efficient LLM reinforcement learning,
[10]

URLhttps://arxiv.org/abs/2507.10628

work page arXiv
[11]

Trust-region adaptive policy optimization, 2025

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang. Trust-region adaptive policy optimization, 2025. URLhttps://arxiv.org/abs/2512.17636. 10

work page arXiv 2025
[12]

POPE: Learning to reason on hard problems via privileged on-policy exploration, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration, 2026. URL https://arxiv.org/abs/ 2601.18779

work page arXiv 2026
[13]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Adaptive guidance accelerates reinforcement learning of reasoning models, 2025

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx. Adaptive guidance accelerates reinforcement learning of reasoning models, 2025. URL https://arxiv.org/ abs/2506.13923

work page arXiv 2025
[15]

ReFT: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with reinforced fine-tuning, 2024. URLhttps://arxiv.org/abs/2401.08967

work page arXiv 2024
[16]

explo- sion/spaCy: v3.7.2: Fixes for APIs and requirements, October 2023

Ines Montani, Matthew Honnibal, Adriane Boyd, Sofie Van Landeghem, and Henning Peters. explo- sion/spaCy: v3.7.2: Fixes for APIs and requirements, October 2023. URL https://doi.org/10.5281/ zenodo.10009823

2023
[17]

SaySelf: Teaching LLMs to express confidence with self-reflective rationales

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: Generalist model for named entity recognition using bidirectional transformer. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V o...

work page doi:10.18653/v1/2024 2024
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Qwen2.5 Technical Report

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/ abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

American invitational mathematics examination (AIME) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2025, 2025. URLhttps://huggingface.co/datasets/math-ai/aime25

2025
[23]

American invitational mathematics examination (AIME) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2026, 2026. URLhttps://huggingface.co/datasets/math-ai/aime26

2026
[24]

American mathematics competitions (AMC) 2023, 2023

Yifan Zhang and Team Math-AI. American mathematics competitions (AMC) 2023, 2023. URL https: //huggingface.co/datasets/math-ai/amc23

2023
[25]

OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,
[26]

URLhttps://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation, 2023. URL https: //arxiv.org/abs/2305.01210

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code, 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210. 03350

2023
[31]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[32]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Stein- hardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[33]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Gen Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pa...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[36]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Learning Like Humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation, 2025

Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, and Qianchun Lu. Learning Like Humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation, 2025. URLhttps://arxiv.org/abs/2505.08364

work page arXiv 2025
[38]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety, 2016. URLhttps://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021. URL https: //arxiv.org/abs/1908.04734. 12

work page arXiv 2021
[40]

Defining and characteriz- ing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characteriz- ing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, edi- tors,Advances in Neural Information Processing Systems, volume 35, pages 9460–9471. Curran As- sociates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/pa...

2022
[41]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V ol...

work page doi:10.18653/v1/n19-1423 2019
[42]

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation, 2019. URLhttps://arxiv.org/abs/1905.02450

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeti...

2020
[44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/ 20-074.html

2020
[45]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022. URLhttps://arxiv. org/abs/2207.14255

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

InCoder: A generative model for code infilling and synthesis,

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. InCoder: A generative model for code infilling and synthesis,
[47]

URLhttps://arxiv.org/abs/2204.05999

work page internal anchor Pith review Pith/arXiv arXiv
[48]

SantaCoder: don’t reach for the stars!,

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferran- dis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garcí...
[49]

URLhttps://arxiv.org/abs/2301.03988

work page arXiv
[50]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. CodeT5+: Open code large language models for code understanding and generation, 2023. URL https: //arxiv.org/abs/2305.07922

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y . Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Evaluation of LLMs on syntax-aware code fill-in-the-middle tasks, 2024

Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of LLMs on syntax-aware code fill-in-the-middle tasks, 2024. URLhttps://arxiv.org/abs/2403.04814

work page arXiv 2024
[54]

Structure-aware fill-in-the-middle pretraining for code, 2025

Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, and Sida Wang. Structure-aware fill-in-the-middle pretraining for code, 2025. URLhttps://arxiv.org/abs/2506.00204. 13

work page arXiv 2025
[55]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696...

work page doi:10.1145/3689031.3696075 2025
[56]

collected

So, we need to find k such that 4k + 3 = 2001. Solving for k: 4k + 3 = 2001 4k = 1998 k = 499.5 Since k must be an integer, we need to adjust our approach. The rollout repeats this failed pattern, repeatedly forcing even targets such as 2000 into a rule that applies to inputs of the form4n+ 1. SMEPO Output The masked trace removes the concrete numbers but...

2001
[57]

- For \( n = 4k + 3 \), \( f(n) = 8k + 2 \)

**Odd Numbers**: - For \( n = 4k + 1 \), \( f(n) = 4k + 3 \). - For \( n = 4k + 3 \), \( f(n) = 8k + 2 \). 24
[58]

Then \( f(n) = 2^k \cdot f(m) \)

**Even Numbers**: - If \( n \) is even, express \( n \) as \( 2^k \cdot m \) where \( m \) is odd. Then \( f(n) = 2^k \cdot f(m) \). **Application to \( n = 1000 \)**: - Decompose \( 1000 = 2^3 \cdot 125 \). - \( 125 \) is odd and of the form \( 4k + 1 \) (\( 125 = 4 \cdot 31 + 1 \)), so \( f(125) = 4 \cdot 31 + 3 = 127 \). - Therefore, \( f(1000) = 2^3 \...

2000
[59]

### Step 1: Understand the function properties - The first property \( f(f(n)) = 2n \) tells us that applying the function twice doubles the input

\( f(4n + 1) = 4n + 3 \) Let’s proceed step by step. ### Step 1: Understand the function properties - The first property \( f(f(n)) = 2n \) tells us that applying the function twice doubles the input. - The second property \( f(4n + 1) = 4n + 3 \) gives us the value of the function for numbers of the form \( 4n + 1 \). 25 ### Step 2: Find a pattern or a w...

2000
[60]

Here, \( 125 \) is odd and can be written as \( 125 = 4 \cdot 31 + 1 \)

**Identify the form of \( n \):** - \( 1000 \) is an even number, and we can express it as \( 1000 = 2^3 \cdot 125 \). Here, \( 125 \) is odd and can be written as \( 125 = 4 \cdot 31 + 1 \)
[61]

**Use the given conditions:** - Since \( 125 = 4 \cdot 31 + 1 \), we have \( f(125) = 4 \cdot 31 + 3 = 127 \)
[62]

inputs": [

**Determine \( f(1000) \):** - First, derive the scaling property. From \( f(f(n)) = 2n \), applying the same identity to \( f(n) \) gives \( f(f(f(n))) = 2f(n) \). Since \( f(f(n)) = 2n \), the left-hand side is also \( f(2n) \). Hence \( f(2n) = 2f(n) \), and by induction \( f(2^k \cdot m) = 2^k \cdot f(m) \). - Since \( 1000 = 2^3 \cdot 125 \), we get ...
[67]

**Find Minimum Cost**: Iterate through all (dx, dy) pairs to find the minimum number of chains, which corresponds to the minimum cost. ### Solution Code ‘‘\‘python n = int(input()) points = [tuple(map(int, input().split())) for _ in range(n)] if n == 1: print(1) exit() points_set = set(points) diff_set = set() 29 for i in range(n): x_i, y_i = points[i] fo...
[70]

**Generate Differences**: For each pair of distinct points, the differences (dx, dy) are computed and stored in a set to avoid duplicates
[72]

**Find Minimum**: The minimum number of chains across all (dx, dy) pairs is determined and printed as the result. Masked Expert Hint To solve this problem, we need to determine the minimum total cost required to collect all balls in a 2D plane by optimally choosing two integers p and q (not both zero). The cost is minimized by arranging the balls into the...
[73]

**Input Handling**: Read the number of balls and their coordinates
[74]

**Special Case for Single Ball**: If there is only one ball, the cost is always 1
[75]

**Generate Differences**: Compute all possible (dx, dy) pairs from the differences between every pair of distinct points
[76]

A chain starts at a point if the previous point (by subtracting dx and dy) is not present

**Determine Chains**: For each (dx, dy) pair, calculate the number of chains required. A chain starts at a point if the previous point (by subtracting dx and dy) is not present
[77]

### Solution Code ‘‘\‘python [CODE] ‘‘\‘ ### Explanation

**Find Minimum Cost**: Iterate through all (dx, dy) pairs to find the minimum number of chains, which corresponds to the minimum cost. ### Solution Code ‘‘\‘python [CODE] ‘‘\‘ ### Explanation
[78]

**Input Handling**: The code reads the number of balls and their coordinates, storing them in a list and a set for quick lookups
[79]

**Special Case Handling**: If there’s only one ball, the cost is directly output as 1
[80]

**Generate Differences**: For each pair of distinct points, the differences (dx, dy) are computed and stored in a set to avoid duplicates. 30
[81]

The count of such starting points gives the number of chains

**Chain Calculation**: For each (dx, dy) pair, the code checks each point to see if it can start a new chain by verifying if the previous point (x-dx, y-dy) exists in the set of points. The count of such starting points gives the number of chains
[82]

__main__

**Find Minimum**: The minimum number of chains across all (dx, dy) pairs is determined and printed as the result. Expert Output ‘‘\‘python n = int(input()) points = [tuple(map(int, input().split())) for _ in range(n)] if n == 1: print(1) exit() points_set = set(points) diff_set = set() for i in range(n): x_i, y_i = points[i] for j in range(n): if i != j: ...

2015
[83]

to New York City

Identify the Amtrak train derailment incident in 2015 involving a train from Washington, D.C. to New York City

2015
[84]

Find the list of eight victims killed in that incident
[85]

Determine which of those victims was an American social entrepreneur
[86]

<search> Find information about the 2015 Amtrak train derailment from Washington, D.C

Find the age of that individual at the time of the incident First, I’ll search for information about the 2015 Amtrak derailment to identify the victims. <search> Find information about the 2015 Amtrak train derailment from Washington, D.C. to New York City, including the list of victims killed in the incident. </search> Round 2: From the list of victims, ...

2015
[87]

Identify the [ENTITY] train derailment incident in [ENTITY] involving a train from [ENTITY], [ENTITY] to [ENTITY]

Showing first 80 references.

[1] [1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[3] [3]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. REINFORCE++: Stabilizing critic-free policy optimization with global advantage normalization, 2025. URL https://arxiv.org/abs/2501.03262

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example, 2025. URL https://arxiv.org/abs/2504.20571

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective, 2025. URL https://arxiv.org/abs/ 2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments, 2025. URL https://arxiv.org/abs/2504.03160

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Learn hard problems during RL with reference guided fine-tuning, 2026

Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, and Tianle Cai. Learn hard problems during RL with reference guided fine-tuning, 2026. URL https: //arxiv.org/abs/2603.01223

work page arXiv 2026

[9] [9]

GHPO: Adaptive guidance for stable and efficient LLM reinforcement learning,

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. GHPO: Adaptive guidance for stable and efficient LLM reinforcement learning,

[10] [10]

URLhttps://arxiv.org/abs/2507.10628

work page arXiv

[11] [11]

Trust-region adaptive policy optimization, 2025

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang. Trust-region adaptive policy optimization, 2025. URLhttps://arxiv.org/abs/2512.17636. 10

work page arXiv 2025

[12] [12]

POPE: Learning to reason on hard problems via privileged on-policy exploration, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration, 2026. URL https://arxiv.org/abs/ 2601.18779

work page arXiv 2026

[13] [13]

Learning to Reason under Off-Policy Guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URLhttps://arxiv.org/abs/2504.14945

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Adaptive guidance accelerates reinforcement learning of reasoning models, 2025

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx. Adaptive guidance accelerates reinforcement learning of reasoning models, 2025. URL https://arxiv.org/ abs/2506.13923

work page arXiv 2025

[15] [15]

ReFT: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with reinforced fine-tuning, 2024. URLhttps://arxiv.org/abs/2401.08967

work page arXiv 2024

[16] [16]

explo- sion/spaCy: v3.7.2: Fixes for APIs and requirements, October 2023

Ines Montani, Matthew Honnibal, Adriane Boyd, Sofie Van Landeghem, and Henning Peters. explo- sion/spaCy: v3.7.2: Fixes for APIs and requirements, October 2023. URL https://doi.org/10.5281/ zenodo.10009823

2023

[17] [17]

SaySelf: Teaching LLMs to express confidence with self-reflective rationales

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: Generalist model for named entity recognition using bidirectional transformer. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V o...

work page doi:10.18653/v1/2024 2024

[18] [18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Qwen2.5 Technical Report

Qwen Team, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/ abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

American invitational mathematics examination (AIME) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2025, 2025. URLhttps://huggingface.co/datasets/math-ai/aime25

2025

[23] [23]

American invitational mathematics examination (AIME) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (AIME) 2026, 2026. URLhttps://huggingface.co/datasets/math-ai/aime26

2026

[24] [24]

American mathematics competitions (AMC) 2023, 2023

Yifan Zhang and Team Math-AI. American mathematics competitions (AMC) 2023, 2023. URL https: //huggingface.co/datasets/math-ai/amc23

2023

[25] [25]

OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems,

[26] [26]

URLhttps://arxiv.org/abs/2402.14008

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation, 2023. URL https: //arxiv.org/abs/2305.01210

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code, 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models, 2023. URL https://arxiv.org/abs/2210. 03350

2023

[31] [31]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[32] [32]

Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Stein- hardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[33] [33]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Gen Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pa...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024

[36] [36]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Learning Like Humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation, 2025

Enci Zhang, Xingang Yan, Wei Lin, Tianxiang Zhang, and Qianchun Lu. Learning Like Humans: Advancing LLM reasoning capabilities via adaptive difficulty curriculum learning and expert-guided self-reformulation, 2025. URLhttps://arxiv.org/abs/2505.08364

work page arXiv 2025

[38] [38]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety, 2016. URLhttps://arxiv.org/abs/1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016

[39] [39]

Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021

Tom Everitt, Marcus Hutter, Ramana Kumar, and Victoria Krakovna. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective, 2021. URL https: //arxiv.org/abs/1908.04734. 12

work page arXiv 2021

[40] [40]

Defining and characteriz- ing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characteriz- ing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, edi- tors,Advances in Neural Information Processing Systems, volume 35, pages 9460–9471. Curran As- sociates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/pa...

2022

[41] [41]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V ol...

work page doi:10.18653/v1/n19-1423 2019

[42] [42]

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation, 2019. URLhttps://arxiv.org/abs/1905.02450

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeti...

2020

[44] [44]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/ 20-074.html

2020

[45] [45]

Efficient Training of Language Models to Fill in the Middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle, 2022. URLhttps://arxiv. org/abs/2207.14255

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

InCoder: A generative model for code infilling and synthesis,

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. InCoder: A generative model for code infilling and synthesis,

[47] [47]

URLhttps://arxiv.org/abs/2204.05999

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

SantaCoder: don’t reach for the stars!,

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferran- dis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garcí...

[49] [49]

URLhttps://arxiv.org/abs/2301.03988

work page arXiv

[50] [50]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. CodeT5+: Open code large language models for code understanding and generation, 2023. URL https: //arxiv.org/abs/2305.07922

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y . Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Evaluation of LLMs on syntax-aware code fill-in-the-middle tasks, 2024

Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. Evaluation of LLMs on syntax-aware code fill-in-the-middle tasks, 2024. URLhttps://arxiv.org/abs/2403.04814

work page arXiv 2024

[54] [54]

Structure-aware fill-in-the-middle pretraining for code, 2025

Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, and Sida Wang. Structure-aware fill-in-the-middle pretraining for code, 2025. URLhttps://arxiv.org/abs/2506.00204. 13

work page arXiv 2025

[55] [55]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696...

work page doi:10.1145/3689031.3696075 2025

[56] [56]

collected

So, we need to find k such that 4k + 3 = 2001. Solving for k: 4k + 3 = 2001 4k = 1998 k = 499.5 Since k must be an integer, we need to adjust our approach. The rollout repeats this failed pattern, repeatedly forcing even targets such as 2000 into a rule that applies to inputs of the form4n+ 1. SMEPO Output The masked trace removes the concrete numbers but...

2001

[57] [57]

- For \( n = 4k + 3 \), \( f(n) = 8k + 2 \)

**Odd Numbers**: - For \( n = 4k + 1 \), \( f(n) = 4k + 3 \). - For \( n = 4k + 3 \), \( f(n) = 8k + 2 \). 24

[58] [58]

Then \( f(n) = 2^k \cdot f(m) \)

**Even Numbers**: - If \( n \) is even, express \( n \) as \( 2^k \cdot m \) where \( m \) is odd. Then \( f(n) = 2^k \cdot f(m) \). **Application to \( n = 1000 \)**: - Decompose \( 1000 = 2^3 \cdot 125 \). - \( 125 \) is odd and of the form \( 4k + 1 \) (\( 125 = 4 \cdot 31 + 1 \)), so \( f(125) = 4 \cdot 31 + 3 = 127 \). - Therefore, \( f(1000) = 2^3 \...

2000

[59] [59]

### Step 1: Understand the function properties - The first property \( f(f(n)) = 2n \) tells us that applying the function twice doubles the input

\( f(4n + 1) = 4n + 3 \) Let’s proceed step by step. ### Step 1: Understand the function properties - The first property \( f(f(n)) = 2n \) tells us that applying the function twice doubles the input. - The second property \( f(4n + 1) = 4n + 3 \) gives us the value of the function for numbers of the form \( 4n + 1 \). 25 ### Step 2: Find a pattern or a w...

2000

[60] [60]

Here, \( 125 \) is odd and can be written as \( 125 = 4 \cdot 31 + 1 \)

**Identify the form of \( n \):** - \( 1000 \) is an even number, and we can express it as \( 1000 = 2^3 \cdot 125 \). Here, \( 125 \) is odd and can be written as \( 125 = 4 \cdot 31 + 1 \)

[61] [61]

**Use the given conditions:** - Since \( 125 = 4 \cdot 31 + 1 \), we have \( f(125) = 4 \cdot 31 + 3 = 127 \)

[62] [62]

inputs": [

**Determine \( f(1000) \):** - First, derive the scaling property. From \( f(f(n)) = 2n \), applying the same identity to \( f(n) \) gives \( f(f(f(n))) = 2f(n) \). Since \( f(f(n)) = 2n \), the left-hand side is also \( f(2n) \). Hence \( f(2n) = 2f(n) \), and by induction \( f(2^k \cdot m) = 2^k \cdot f(m) \). - Since \( 1000 = 2^3 \cdot 125 \), we get ...

[63] [67]

**Find Minimum Cost**: Iterate through all (dx, dy) pairs to find the minimum number of chains, which corresponds to the minimum cost. ### Solution Code ‘‘\‘python n = int(input()) points = [tuple(map(int, input().split())) for _ in range(n)] if n == 1: print(1) exit() points_set = set(points) diff_set = set() 29 for i in range(n): x_i, y_i = points[i] fo...

[64] [70]

**Generate Differences**: For each pair of distinct points, the differences (dx, dy) are computed and stored in a set to avoid duplicates

[65] [72]

**Find Minimum**: The minimum number of chains across all (dx, dy) pairs is determined and printed as the result. Masked Expert Hint To solve this problem, we need to determine the minimum total cost required to collect all balls in a 2D plane by optimally choosing two integers p and q (not both zero). The cost is minimized by arranging the balls into the...

[66] [73]

**Input Handling**: Read the number of balls and their coordinates

[67] [74]

**Special Case for Single Ball**: If there is only one ball, the cost is always 1

[68] [75]

**Generate Differences**: Compute all possible (dx, dy) pairs from the differences between every pair of distinct points

[69] [76]

A chain starts at a point if the previous point (by subtracting dx and dy) is not present

**Determine Chains**: For each (dx, dy) pair, calculate the number of chains required. A chain starts at a point if the previous point (by subtracting dx and dy) is not present

[70] [77]

### Solution Code ‘‘\‘python [CODE] ‘‘\‘ ### Explanation

**Find Minimum Cost**: Iterate through all (dx, dy) pairs to find the minimum number of chains, which corresponds to the minimum cost. ### Solution Code ‘‘\‘python [CODE] ‘‘\‘ ### Explanation

[71] [78]

**Input Handling**: The code reads the number of balls and their coordinates, storing them in a list and a set for quick lookups

[72] [79]

**Special Case Handling**: If there’s only one ball, the cost is directly output as 1

[73] [80]

**Generate Differences**: For each pair of distinct points, the differences (dx, dy) are computed and stored in a set to avoid duplicates. 30

[74] [81]

The count of such starting points gives the number of chains

**Chain Calculation**: For each (dx, dy) pair, the code checks each point to see if it can start a new chain by verifying if the previous point (x-dx, y-dy) exists in the set of points. The count of such starting points gives the number of chains

[75] [82]

__main__

**Find Minimum**: The minimum number of chains across all (dx, dy) pairs is determined and printed as the result. Expert Output ‘‘\‘python n = int(input()) points = [tuple(map(int, input().split())) for _ in range(n)] if n == 1: print(1) exit() points_set = set(points) diff_set = set() for i in range(n): x_i, y_i = points[i] for j in range(n): if i != j: ...

2015

[76] [83]

to New York City

Identify the Amtrak train derailment incident in 2015 involving a train from Washington, D.C. to New York City

2015

[77] [84]

Find the list of eight victims killed in that incident

[78] [85]

Determine which of those victims was an American social entrepreneur

[79] [86]

<search> Find information about the 2015 Amtrak train derailment from Washington, D.C

Find the age of that individual at the time of the incident First, I’ll search for information about the 2015 Amtrak derailment to identify the victims. <search> Find information about the 2015 Amtrak train derailment from Washington, D.C. to New York City, including the list of victims killed in the incident. </search> Round 2: From the list of victims, ...

2015

[80] [87]

Identify the [ENTITY] train derailment incident in [ENTITY] involving a train from [ENTITY], [ENTITY] to [ENTITY]