pith. sign in

arxiv: 2606.01091 · v1 · pith:IN6YQY2Mnew · submitted 2026-05-31 · 💻 cs.CL

Deep Research as Rubric for Reinforcement Learning

Pith reviewed 2026-06-28 17:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords rubric constructionreinforcement learningagentic searchreward signalsopen-ended tasksGRPObootstrap learningpolicy optimization
0
0 comments X

The pith

Reframing rubric construction as an evidence-driven research process via iterative multi-turn agentic search yields scalable fine-grained reward signals for open-ended tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing rubrics for open-ended reasoning and long-form generation are treated as static artifacts and often miss task-specific knowledge-intensive dimensions. The paper reframes rubric construction itself as a research problem solved through a two-stage process: first using iterative multi-turn agentic search to gather domain facts, structural constraints, and failure modes, then distilling the evidence into atomic independently verifiable constraints. These constraints provide the reward signal for GRPO-based policy optimization, and the approach supports bootstrap use where the training model generates its own rubrics without frontier assistance. Experiments on six benchmarks show competitive results with only 1K-3K training instances, with bootstrap rubrics reaching best overall performance after three iterations. A sympathetic reader would care because reliable automatic verification has been a bottleneck for scaling reinforcement learning on complex open-ended tasks.

Core claim

DR-rubric is a two-stage framework in which Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search, and Stage II distills this evidence into atomic, independently verifiable constraints. These constraints serve as reward signals for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, the method supports bootstrap rubric generation without frontier-model assistance. On six benchmarks spanning agentic research and expert reasoning, the approach achieves strong competitive performance with only 1K-3K training instances; GPT-5-generated rubrics benefit breadth coverage on agentic task

What carries the argument

The DR-rubric two-stage framework, where Stage I performs iterative multi-turn agentic search to synthesize evidence and Stage II distills it into atomic verifiable constraints used as reward signals.

If this is right

  • The training model can generate its own rubrics, enabling bootstrap refinement without external frontier models.
  • GPT-5-generated rubrics improve breadth coverage specifically on agentic tasks.
  • Gemini-generated rubrics deliver the most balanced results across both agentic and expert-reasoning benchmarks.
  • Bootstrap rubrics evolve through specialization then rebalancing, reaching peak performance at the third iteration.
  • Competitive policy optimization is possible on the tested benchmarks with training sets as small as 1K-3K instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evidence-driven rubric process could be applied to create reward signals for other open-ended domains such as creative writing or scientific hypothesis generation.
  • Self-generated rubrics open the possibility of closed-loop self-improvement in which each training cycle produces better constraints for the next.
  • Mixing rubric sources (for example, agentic-search rubrics for breadth with expert-reasoning rubrics for depth) might produce task-specific hybrids superior to any single source.
  • The discovered failure modes could be reused as diagnostic tools to evaluate models even outside the reinforcement-learning setting.

Load-bearing premise

Iterative multi-turn agentic search can reliably discover and synthesize the task-specific, knowledge-intensive dimensions and failure modes that matter most for the target task.

What would settle it

A controlled experiment in which policies trained with DR-rubric constraints show no improvement over policies trained with static hand-crafted or simple prompt-generated rubrics on the same six benchmarks using identical 1K-3K instance counts and GRPO settings.

Figures

Figures reproduced from arXiv: 2606.01091 by Bo Chen, Deqing Yang, Jiaqing Liang, Lefan Zhang, Wangyi Mei, Yan Gao, Yao Hu, Yin Cai, Yi Wu, Zhenhan Bai, Zhenxin Ding, Zhouhong Gu.

Figure 1
Figure 1. Figure 1: Overview of DR-Rubric. Given a task prompt [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training efficiency: per￾benchmark scores vs. training instances (log scale). DR-Rubric-8B achieves the strongest results with significantly fewer training instances than all baselines. Reasoning transfer. DR-Rubric-8B (BS-3) leads on MMLU-Pro (78.0, +3.7) and MMLU (85.3, +1.5), while DR-Rubric-8B (Gemini) achieves the highest GPQA score (57.3, +1.3 over WebExplorer-8B). Rubric-based RL thus transfers beyo… view at source ↗
Figure 3
Figure 3. Figure 3: Bootstrap trajectory. (a) Per￾formance is non-monotonic. (b) Verifia￾bility and formula usage surge at BS-2 then stabilize, while constraint verbosity peaks then retracts. Bootstrap rubrics: specialization-to-rebalancing. Bootstrap rubrics exhibit a non-monotonic trajectory (Fig￾ure 3): BS-1 achieves strong reasoning scores (GPQA 57.0, MMLU 84.0); BS-2 sees reasoning scores dip as rubric structural shifts … view at source ↗
Figure 4
Figure 4. Figure 4: Polarization predicts over-bootstrap be [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dimension count distribution across bootstrap steps. Step 1 is bimodal (peaks at 4 and 7–8 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics over 70 GRPO steps for GPT-5 and bootstrap-rubric variants (BS-1–4). [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: BS-5 training collapse anatomy (30B-A3B). Policy loss spikes at step 20, gradient norm [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
read the original abstract

Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts -- either hand-crafted or prompt-generated -- and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K -- 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Deep Research as Rubric (DR-rubric), a two-stage framework that reframes rubric construction for RL policy optimization as an evidence-driven research process. Stage I performs iterative multi-turn agentic search to discover domain facts, structural constraints, and failure modes; Stage II distills the evidence into atomic, independently verifiable constraints used for GRPO-based optimization. The approach supports bootstrap rubric generation with an 8B model and is evaluated on 6 benchmarks spanning agentic research and expert reasoning, claiming competitive performance with 1K–3K training instances, generator-dependent strengths, and progressive improvement across bootstrap iterations.

Significance. If the empirical claims hold, the work could advance reward design for open-ended tasks by replacing static or prompt-generated rubrics with synthesized, task-specific constraints derived from external knowledge, offering a scalable alternative that reduces reliance on frontier models while improving fine-grained signal quality.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'strong competitive performance' on 6 benchmarks with only 1K–3K instances is stated without any metrics, baselines, error bars, ablation results, or experimental protocol, rendering the primary empirical contribution impossible to assess or replicate from the supplied text.
  2. [Abstract] Abstract (Stage I description): the assumption that iterative multi-turn agentic search 'reliably discovers and synthesizes the task-specific, knowledge-intensive dimensions and failure modes' is presented as an observed outcome without any validation, human evaluation, or failure-case analysis of the search process itself.
minor comments (1)
  1. [Abstract] The abstract refers to 'GRPO-based policy optimization' and 'bootstrap rubric generation' without defining the acronyms or the precise optimization objective on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below, clarifying that the full manuscript supplies the requested experimental details while agreeing to strengthen the abstract and add limitations discussion where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'strong competitive performance' on 6 benchmarks with only 1K–3K instances is stated without any metrics, baselines, error bars, ablation results, or experimental protocol, rendering the primary empirical contribution impossible to assess or replicate from the supplied text.

    Authors: The abstract is a concise summary constrained by length limits. The full manuscript provides all requested elements in Section 4 (Tables 1–3 report exact metrics, baselines including standard GRPO and prompt-generated rubrics, and comparisons across the 6 benchmarks) and Section 3 (full experimental protocol with 1K–3K instance counts and training details). Results include means and standard deviations from multiple runs. We will revise the abstract to include one or two representative quantitative results. revision: yes

  2. Referee: [Abstract] Abstract (Stage I description): the assumption that iterative multi-turn agentic search 'reliably discovers and synthesizes the task-specific, knowledge-intensive dimensions and failure modes' is presented as an observed outcome without any validation, human evaluation, or failure-case analysis of the search process itself.

    Authors: The abstract summarizes the observed outcome. The manuscript provides supporting evidence via progressive performance gains across bootstrap iterations (Section 4.3) and qualitative examples of generated atomic constraints in the appendix, which illustrate coverage of domain facts and failure modes. We agree that explicit human evaluation of the search process itself is absent and will add a limitations paragraph addressing this and potential failure cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents DR-rubric as an empirical two-stage engineering framework (agentic search followed by distillation into verifiable constraints) evaluated on benchmarks. No equations, fitted parameters, predictions of derived quantities, or load-bearing self-citations appear in the abstract or described claims. The bootstrap evolution is reported as an observed experimental outcome across iterations rather than a definitional identity or fitted input renamed as prediction. The central claim—that reframing rubric construction as research yields better rewards—rests on external benchmark results and does not reduce to its own inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that multi-turn agentic search can surface the precise knowledge needed for atomic constraints; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Rubric construction is itself a research problem that benefits from iterative external knowledge synthesis via agentic search.
    This premise underpins the entire two-stage design and is invoked in the abstract description of Stage I.

pith-pipeline@v0.9.1-grok · 5828 in / 1248 out tokens · 23934 ms · 2026-06-28T17:14:10.028940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 26 canonical work pages · 16 internal anchors

  1. [1]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.CoRR, abs/2507.17746, 2025

  2. [2]

    Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

    Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric- based reward modeling for large language model post-training.CoRR, abs/2509.21500, 2025

  3. [3]

    Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning

    Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, and Mingli Song. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general LLM reasoning. CoRR, abs/2508.16949, 2025

  4. [4]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.CoRR, abs/2505.08775, 2025

  5. [5]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.CoRR, abs/2510.07743, 2025

  6. [6]

    Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

    Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, and Bolin Ding. Auto-rubric: Learning to extract generalizable criteria for reward modeling.CoRR, abs/2510.17314, 2025

  7. [7]

    Preprint, arXiv:2508.12790

    Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.CoRR, abs/2508.12790, 2025

  8. [8]

    ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

    Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, and Jiajun Zhang. ACE-RL: adaptive constraint-enhanced reward for long-form generation reinforcement learning.CoRR, abs/2509.04903, 2025

  9. [9]

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David A. Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. DR tulu: Reinforcement learning with...

  10. [10]

    Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

    MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.CoRR, abs/2510.07284, 2025

  11. [11]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, L...

  12. [12]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.CoRR, abs/2401.01335, 2024

  13. [13]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.CoRR, abs/2401.10020, 2024. 12

  14. [14]

    Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar

    Li S. Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar. Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey-mined questions and rubrics. CoRR, abs/2509.00496, 2025

  15. [15]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.CoRR, abs/2506.11763, 2025

  16. [16]

    LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

    Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, and Ting Su. Localsearchbench: Benchmarking agentic search in real-world local life services.CoRR, abs/2512.07436, 2025

  17. [17]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

  18. [18]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

  19. [19]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

  20. [20]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

  21. [21]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Heng Ji, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.CoRR, abs/2503.09516, 2025

  22. [22]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning

    Qingfei Zhao, Ruobing Wang, Yukun Yan, Ruihua Song, Zhichao Duan, Renjun Hu, Xinyu Cao, Ying Chen, Li Ma, Shu Li, Yong Zhang, Mingde Shao, Zhiyuan Liu, and Maosong Sun. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. CoRR, abs/2411.02337, 2024

  23. [23]

    OpenAI GPT-5 System Card

    OpenAI. Openai GPT-5 system card.CoRR, abs/2601.03267, 2026

  24. [24]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, February 2026. Accessed: 2026-05-30

  25. [25]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

  26. [26]

    WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, Zheng Liu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability. CoRR, abs/2504.21776, 2025

  27. [27]

    Tongyi DeepResearch Technical Report

    Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayi...

  28. [28]

    DeerFlow: Deep exploration and efficient research flow

    Daniel Walnut, Henry Li, and ByteDance Inc. DeerFlow: Deep exploration and efficient research flow. https://github.com/bytedance/deer-flow, 2025. Open-source multi- agent framework for deep research automation

  29. [29]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  30. [30]

    OpenReview.net, 2023

  31. [31]

    Ministral 3

    Mistral AI. Ministral 3.CoRR, abs/2601.08584, 2026

  32. [32]

    Ground Truth

    S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y . Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y . Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y . Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zha...

  33. [33]

    15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework design for academic research

    Key Dimensions of Quality:(What specific attributes—e.g., conciseness, creativity, coding style—matter most forthisspecific query?) 4.Edge Cases & Constraints:(What subtle details must be correct?) Action:Begin your deep research into{{ query }}now. 15 System Prompt: Stage II (Rubric Synthesis) # Role Definition You are an expert in evaluation framework d...

  34. [34]

    This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain

    Analyze the Final Report:Use the provided {{ final_report }} as theground truth. This report contains the necessary factual information (algorithms, parameters, benchmarks, etc.) that a high-quality responseshouldcontain

  35. [35]

    Prohibited Content:

    Design the Framework:Construct evaluation criteria where the specific metrics, names, and thresholds are derived from the facts found in the{{ final_report }}. Prohibited Content:

  36. [36]

    Direct answers to the query or summaries of the final report

  37. [37]

    Recommendations, tutorials, or explanatory content

  38. [38]

    >=[X] specific algorithms included

    Generic criteria unrelated to the specific facts in the final report. # Evaluation Framework Design Requirements 1.Content-Centric & Fact-Based Evaluation: • Core Principle:Focus oninformation qualityandfactual completenessbased on the{{ final_report }}. • Calibration:Do not set unrealistic thresholds. Align thresholds with the actual information landscap...

  39. [39]

    #### Core Dimension:

    Framework Generation:Assess response quality based on the facts provided in {{ final_report }}. 2.Dimension Design:3–8 dimensions, Core Dimension first. 3.Core Dimension: • Explicitly labeled (e.g., “#### Core Dimension: . . . ”). 16 • Grounded in Fact:Check for the specific entities (methods, papers, data) present in the{{ final_report }}. 4.Strict Thres...

  40. [40]

    Stop ifP n >max(0.15,2×P n−1)

    Primary criterion:After bootstrap step n, compute polarization Pn (fraction of samples scoring 0 or≥0.99). Stop ifP n >max(0.15,2×P n−1). 2.Model selection:arg min k Pk (lowest polarization)

  41. [41]

    Auxiliary:If min(entropy)<0.70 within a step, the rubrics are too narrow—consider reducing constraint specificity

  42. [42]

    Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute

    Hard bound:Do not exceed 3 bootstrap iterations without external rubric grounding (e.g., a GPT-5 “reset” step). Retrospective validation on both scales: this rule correctly selects BS-2 and terminates at BS-3, avoiding BS-4/BS-5 and saving 40–60% of total bootstrap compute. J Fresh-Start Bootstrap: Disentangling Rubric Quality from Policy Drift The cumula...