Deep Research as Rubric for Reinforcement Learning

Bo Chen; Deqing Yang; Jiaqing Liang; Lefan Zhang; Wangyi Mei; Yan Gao; Yao Hu; Yin Cai; Yi Wu; Zhenhan Bai

arxiv: 2606.01091 · v1 · pith:IN6YQY2Mnew · submitted 2026-05-31 · 💻 cs.CL

Deep Research as Rubric for Reinforcement Learning

Wangyi Mei , Zhouhong Gu , Zhenhan Bai , Yin Cai , Lefan Zhang , Zhenxin Ding , Bo Chen , Yan Gao

show 4 more authors

Yi Wu Yao Hu Jiaqing Liang Deqing Yang

This is my paper

Pith reviewed 2026-06-28 17:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords rubric constructionreinforcement learningagentic searchreward signalsopen-ended tasksGRPObootstrap learningpolicy optimization

0 comments

The pith

Reframing rubric construction as an evidence-driven research process via iterative multi-turn agentic search yields scalable fine-grained reward signals for open-ended tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing rubrics for open-ended reasoning and long-form generation are treated as static artifacts and often miss task-specific knowledge-intensive dimensions. The paper reframes rubric construction itself as a research problem solved through a two-stage process: first using iterative multi-turn agentic search to gather domain facts, structural constraints, and failure modes, then distilling the evidence into atomic independently verifiable constraints. These constraints provide the reward signal for GRPO-based policy optimization, and the approach supports bootstrap use where the training model generates its own rubrics without frontier assistance. Experiments on six benchmarks show competitive results with only 1K-3K training instances, with bootstrap rubrics reaching best overall performance after three iterations. A sympathetic reader would care because reliable automatic verification has been a bottleneck for scaling reinforcement learning on complex open-ended tasks.

Core claim

DR-rubric is a two-stage framework in which Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search, and Stage II distills this evidence into atomic, independently verifiable constraints. These constraints serve as reward signals for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, the method supports bootstrap rubric generation without frontier-model assistance. On six benchmarks spanning agentic research and expert reasoning, the approach achieves strong competitive performance with only 1K-3K training instances; GPT-5-generated rubrics benefit breadth coverage on agentic task

What carries the argument

The DR-rubric two-stage framework, where Stage I performs iterative multi-turn agentic search to synthesize evidence and Stage II distills it into atomic verifiable constraints used as reward signals.

If this is right

The training model can generate its own rubrics, enabling bootstrap refinement without external frontier models.
GPT-5-generated rubrics improve breadth coverage specifically on agentic tasks.
Gemini-generated rubrics deliver the most balanced results across both agentic and expert-reasoning benchmarks.
Bootstrap rubrics evolve through specialization then rebalancing, reaching peak performance at the third iteration.
Competitive policy optimization is possible on the tested benchmarks with training sets as small as 1K-3K instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidence-driven rubric process could be applied to create reward signals for other open-ended domains such as creative writing or scientific hypothesis generation.
Self-generated rubrics open the possibility of closed-loop self-improvement in which each training cycle produces better constraints for the next.
Mixing rubric sources (for example, agentic-search rubrics for breadth with expert-reasoning rubrics for depth) might produce task-specific hybrids superior to any single source.
The discovered failure modes could be reused as diagnostic tools to evaluate models even outside the reinforcement-learning setting.

Load-bearing premise

Iterative multi-turn agentic search can reliably discover and synthesize the task-specific, knowledge-intensive dimensions and failure modes that matter most for the target task.

What would settle it

A controlled experiment in which policies trained with DR-rubric constraints show no improvement over policies trained with static hand-crafted or simple prompt-generated rubrics on the same six benchmarks using identical 1K-3K instance counts and GRPO settings.

Figures

Figures reproduced from arXiv: 2606.01091 by Bo Chen, Deqing Yang, Jiaqing Liang, Lefan Zhang, Wangyi Mei, Yan Gao, Yao Hu, Yin Cai, Yi Wu, Zhenhan Bai, Zhenxin Ding, Zhouhong Gu.

**Figure 2.** Figure 2: Training efficiency: perbenchmark scores vs. training instances (log scale). DR-Rubric-8B achieves the strongest results with significantly fewer training instances than all baselines. Reasoning transfer. DR-Rubric-8B (BS-3) leads on MMLU-Pro (78.0, +3.7) and MMLU (85.3, +1.5), while DR-Rubric-8B (Gemini) achieves the highest GPQA score (57.3, +1.3 over WebExplorer-8B). Rubric-based RL thus transfers beyo… view at source ↗

**Figure 3.** Figure 3: Bootstrap trajectory. (a) Performance is non-monotonic. (b) Verifiability and formula usage surge at BS-2 then stabilize, while constraint verbosity peaks then retracts. Bootstrap rubrics: specialization-to-rebalancing. Bootstrap rubrics exhibit a non-monotonic trajectory (Figure 3): BS-1 achieves strong reasoning scores (GPQA 57.0, MMLU 84.0); BS-2 sees reasoning scores dip as rubric structural shifts … view at source ↗

**Figure 4.** Figure 4: Polarization predicts over-bootstrap be [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Dimension count distribution across bootstrap steps. Step 1 is bimodal (peaks at 4 and 7–8 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics over 70 GRPO steps for GPT-5 and bootstrap-rubric variants (BS-1–4). [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: BS-5 training collapse anatomy (30B-A3B). Policy loss spikes at step 20, gradient norm [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DR-rubric treats rubric construction as an agentic research process with a two-stage search-plus-distillation pipeline and self-bootstrapping, but the abstract supplies no metrics or details to check the performance claims.

read the letter

The main point is a two-stage method that first runs iterative multi-turn agentic search to surface domain facts, constraints, and failure modes, then distills them into atomic verifiable items for GRPO training. A bootstrap variant lets the model under training generate its own rubrics, with reported evolution toward better balance by the third iteration. Different frontier models are tested for the search stage and produce rubrics with different coverage strengths.

This directly tackles the issue that static or lightly prompted rubrics often skip task-specific knowledge on open-ended work. Framing rubric building as evidence collection rather than template filling is a straightforward and useful shift, and the self-bootstrapping angle reduces reliance on larger models. The small data regime (1K-3K examples) and coverage of both agentic and expert-reasoning benchmarks also line up with practical needs in RL for reasoning.

The abstract states competitive results across six benchmarks but gives no numbers, baselines, error bars, or examples of the generated rubrics or search traces. That makes it impossible to judge whether the agentic stage actually finds the right dimensions or whether the gains are real versus artifacts of the evaluation. The core assumption that multi-turn search reliably surfaces the most relevant failure modes still needs concrete validation from the full paper.

The work is aimed at researchers doing reward modeling and policy optimization for complex generation and reasoning tasks. Anyone already experimenting with rubric-based or process-supervision signals would get value from the pipeline description and the bootstrap observations. It is coherent enough on its own terms to deserve peer review so the experimental sections can be examined.

Referee Report

2 major / 1 minor

Summary. The paper proposes Deep Research as Rubric (DR-rubric), a two-stage framework that reframes rubric construction for RL policy optimization as an evidence-driven research process. Stage I performs iterative multi-turn agentic search to discover domain facts, structural constraints, and failure modes; Stage II distills the evidence into atomic, independently verifiable constraints used for GRPO-based optimization. The approach supports bootstrap rubric generation with an 8B model and is evaluated on 6 benchmarks spanning agentic research and expert reasoning, claiming competitive performance with 1K–3K training instances, generator-dependent strengths, and progressive improvement across bootstrap iterations.

Significance. If the empirical claims hold, the work could advance reward design for open-ended tasks by replacing static or prompt-generated rubrics with synthesized, task-specific constraints derived from external knowledge, offering a scalable alternative that reduces reliance on frontier models while improving fine-grained signal quality.

major comments (2)

[Abstract] Abstract: the central claim of 'strong competitive performance' on 6 benchmarks with only 1K–3K instances is stated without any metrics, baselines, error bars, ablation results, or experimental protocol, rendering the primary empirical contribution impossible to assess or replicate from the supplied text.
[Abstract] Abstract (Stage I description): the assumption that iterative multi-turn agentic search 'reliably discovers and synthesizes the task-specific, knowledge-intensive dimensions and failure modes' is presented as an observed outcome without any validation, human evaluation, or failure-case analysis of the search process itself.

minor comments (1)

[Abstract] The abstract refers to 'GRPO-based policy optimization' and 'bootstrap rubric generation' without defining the acronyms or the precise optimization objective on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below, clarifying that the full manuscript supplies the requested experimental details while agreeing to strengthen the abstract and add limitations discussion where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'strong competitive performance' on 6 benchmarks with only 1K–3K instances is stated without any metrics, baselines, error bars, ablation results, or experimental protocol, rendering the primary empirical contribution impossible to assess or replicate from the supplied text.

Authors: The abstract is a concise summary constrained by length limits. The full manuscript provides all requested elements in Section 4 (Tables 1–3 report exact metrics, baselines including standard GRPO and prompt-generated rubrics, and comparisons across the 6 benchmarks) and Section 3 (full experimental protocol with 1K–3K instance counts and training details). Results include means and standard deviations from multiple runs. We will revise the abstract to include one or two representative quantitative results. revision: yes
Referee: [Abstract] Abstract (Stage I description): the assumption that iterative multi-turn agentic search 'reliably discovers and synthesizes the task-specific, knowledge-intensive dimensions and failure modes' is presented as an observed outcome without any validation, human evaluation, or failure-case analysis of the search process itself.

Authors: The abstract summarizes the observed outcome. The manuscript provides supporting evidence via progressive performance gains across bootstrap iterations (Section 4.3) and qualitative examples of generated atomic constraints in the appendix, which illustrate coverage of domain facts and failure modes. We agree that explicit human evaluation of the search process itself is absent and will add a limitations paragraph addressing this and potential failure cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents DR-rubric as an empirical two-stage engineering framework (agentic search followed by distillation into verifiable constraints) evaluated on benchmarks. No equations, fitted parameters, predictions of derived quantities, or load-bearing self-citations appear in the abstract or described claims. The bootstrap evolution is reported as an observed experimental outcome across iterations rather than a definitional identity or fitted input renamed as prediction. The central claim—that reframing rubric construction as research yields better rewards—rests on external benchmark results and does not reduce to its own inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that multi-turn agentic search can surface the precise knowledge needed for atomic constraints; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Rubric construction is itself a research problem that benefits from iterative external knowledge synthesis via agentic search.
This premise underpins the entire two-stage design and is invoked in the abstract description of Stage I.

pith-pipeline@v0.9.1-grok · 5828 in / 1248 out tokens · 23934 ms · 2026-06-28T17:14:10.028940+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 26 canonical work pages · 16 internal anchors

[1]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

work page arXiv 2025
[3]

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning. CoRR, abs/2508.16949, 2025

work page arXiv 2025
[4]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

work page arXiv 2025
[6]

Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, and Bolin Ding. Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

work page arXiv 2025
[7]

Preprint, arXiv:2508.12790

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025

work page arXiv 2025
[8]

ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, and Jiajun Zhang. ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

work page arXiv 2025
[9]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David A. Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR tulu: Reinforcement learning with...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

work page arXiv 2025
[11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, L...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.CoRR, abs/2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.CoRR, abs/2401.10020, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar

Li S. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar. Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey-mined questions and rubrics. CoRR, abs/2509.00496, 2025

work page arXiv 2025
[15]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.CoRR, abs/2506.11763, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, and Ting Su. Localsearchbench: Benchmarking agentic search in real-world local life services.CoRR, abs/2512.07436, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

2024
[19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

2021
[20]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Heng Ji, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning

Qingfei Zhao, Ruobing Wang, Yukun Yan, Ruihua Song, Zhichao Duan, Renjun Hu, Xinyu Cao, Ying Chen, Li Ma, Shu Li, Yong Zhang, Mingde Shao, Zhiyuan Liu, and Maosong Sun. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. CoRR, abs/2411.02337, 2024

work page arXiv 2024
[23]

OpenAI GPT-5 System Card

OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-05-30

2026
[25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, Zheng Liu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

DeerFlow: Deep exploration and efficient research flow

Daniel Walnut, Henry Li, and ByteDance Inc. DeerFlow: Deep exploration and efficient research flow. https://github.com/bytedance/deer-flow, 2025. Open-source multi- agent framework for deep research automation

2025
[29]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023
[30]

OpenReview.net, 2023

2023
[31]

Ministral 3

Mistral AI. Ministral 3.CoRR, abs/2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Ground Truth

S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y . Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y . Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y . Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zha...

work page arXiv 2026
[33]

15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework design for academic research

Key Dimensions of Quality:(What specific attributes—e.g., conciseness, creativity, coding style—matter most forthisspecific query?) 4.Edge Cases & Constraints:(What subtle details must be correct?) Action:Begin your deep research into{{ query }}now. 15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework d...
[34]

This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain

Analyze the Final Report:Use the provided {{ final_report }} as theground truth. This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain
[35]

Prohibited Content:

Design the Framework:Construct evaluation criteria where the specific metrics, names, and thresholds are derived from the facts found in the{{ final_report }}. Prohibited Content:
[36]

Direct answers to the query or summaries of the final report
[37]

Recommendations, tutorials, or explanatory content
[38]

>=[X] specific algorithms included

Generic criteria unrelated to the specific facts in the final report. # Evaluation Framework Design Requirements 1.Content-Centric & Fact-Based Evaluation: • Core Principle:Focus oninformation qualityandfactual completenessbased on the{{ final_report }}. • Calibration:Do not set unrealistic thresholds. Align thresholds with the actual information landscap...
[39]

#### Core Dimension:

Framework Generation:Assess response quality based on the facts provided in {{ final_report }}. 2.Dimension Design:3–8 dimensions, Core Dimension first. 3.Core Dimension: • Explicitly labeled (e.g., “#### Core Dimension: . . . ”). 16 • Grounded in Fact:Check for the specific entities (methods, papers, data) present in the{{ final_report }}. 4.Strict Thres...

2057
[40]

Stop ifP n >max(0.15,2×P n−1)

Primary criterion:After bootstrap step n, compute polarization Pn (fraction of samples scoring 0 or≥0.99). Stop ifP n >max(0.15,2×P n−1). 2.Model selection:arg min k Pk (lowest polarization)
[41]

Auxiliary:If min(entropy)<0.70 within a step, the rubrics are too narrow—consider reducing constraint specificity
[42]

Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute

Hard bound:Do not exceed 3 bootstrap iterations without external rubric grounding (e.g., a GPT-5 “reset” step). Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute. J Fresh-Start Bootstrap: Disentangling Rubric Quality from Policy Drift The cumula...

[1] [1]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

work page arXiv 2025

[3] [3]

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning. CoRR, abs/2508.16949, 2025

work page arXiv 2025

[4] [4]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

work page arXiv 2025

[6] [6]

Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, and Bolin Ding. Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

work page arXiv 2025

[7] [7]

Preprint, arXiv:2508.12790

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025

work page arXiv 2025

[8] [8]

ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, and Jiajun Zhang. ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

work page arXiv 2025

[9] [9]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David A. Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR tulu: Reinforcement learning with...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

work page arXiv 2025

[11] [11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, L...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.CoRR, abs/2401.01335, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.CoRR, abs/2401.10020, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar

Li S. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar. Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey-mined questions and rubrics. CoRR, abs/2509.00496, 2025

work page arXiv 2025

[15] [15]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.CoRR, abs/2506.11763, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, and Ting Su. Localsearchbench: Benchmarking agentic search in real-world local life services.CoRR, abs/2512.07436, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

2024

[19] [19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

2021

[20] [20]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Heng Ji, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning

Qingfei Zhao, Ruobing Wang, Yukun Yan, Ruihua Song, Zhichao Duan, Renjun Hu, Xinyu Cao, Ying Chen, Li Ma, Shu Li, Yong Zhang, Mingde Shao, Zhiyuan Liu, and Maosong Sun. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. CoRR, abs/2411.02337, 2024

work page arXiv 2024

[23] [23]

OpenAI GPT-5 System Card

OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-05-30

2026

[25] [25]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, Zheng Liu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Tongyi DeepResearch Technical Report

Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

DeerFlow: Deep exploration and efficient research flow

Daniel Walnut, Henry Li, and ByteDance Inc. DeerFlow: Deep exploration and efficient research flow. https://github.com/bytedance/deer-flow, 2025. Open-source multi- agent framework for deep research automation

2025

[29] [29]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023

[30] [30]

OpenReview.net, 2023

2023

[31] [31]

Ministral 3

Mistral AI. Ministral 3.CoRR, abs/2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Ground Truth

S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y . Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y . Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y . Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zha...

work page arXiv 2026

[33] [33]

15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework design for academic research

Key Dimensions of Quality:(What specific attributes—e.g., conciseness, creativity, coding style—matter most forthisspecific query?) 4.Edge Cases & Constraints:(What subtle details must be correct?) Action:Begin your deep research into{{ query }}now. 15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework d...

[34] [34]

This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain

Analyze the Final Report:Use the provided {{ final_report }} as theground truth. This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain

[35] [35]

Prohibited Content:

Design the Framework:Construct evaluation criteria where the specific metrics, names, and thresholds are derived from the facts found in the{{ final_report }}. Prohibited Content:

[36] [36]

Direct answers to the query or summaries of the final report

[37] [37]

Recommendations, tutorials, or explanatory content

[38] [38]

>=[X] specific algorithms included

Generic criteria unrelated to the specific facts in the final report. # Evaluation Framework Design Requirements 1.Content-Centric & Fact-Based Evaluation: • Core Principle:Focus oninformation qualityandfactual completenessbased on the{{ final_report }}. • Calibration:Do not set unrealistic thresholds. Align thresholds with the actual information landscap...

[39] [39]

#### Core Dimension:

Framework Generation:Assess response quality based on the facts provided in {{ final_report }}. 2.Dimension Design:3–8 dimensions, Core Dimension first. 3.Core Dimension: • Explicitly labeled (e.g., “#### Core Dimension: . . . ”). 16 • Grounded in Fact:Check for the specific entities (methods, papers, data) present in the{{ final_report }}. 4.Strict Thres...

2057

[40] [40]

Stop ifP n >max(0.15,2×P n−1)

Primary criterion:After bootstrap step n, compute polarization Pn (fraction of samples scoring 0 or≥0.99). Stop ifP n >max(0.15,2×P n−1). 2.Model selection:arg min k Pk (lowest polarization)

[41] [41]

Auxiliary:If min(entropy)<0.70 within a step, the rubrics are too narrow—consider reducing constraint specificity

[42] [42]

Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute

Hard bound:Do not exceed 3 bootstrap iterations without external rubric grounding (e.g., a GPT-5 “reset” step). Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute. J Fresh-Start Bootstrap: Disentangling Rubric Quality from Policy Drift The cumula...