Recognition: no theorem link
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Pith reviewed 2026-05-12 03:45 UTC · model grok-4.3
The pith
Rubrics act as the shared interface for structuring, judging, and remembering research trajectories in reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RubricEM first decomposes research trajectories into stage-aware policies by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then applies Stage-Structured GRPO to deliver denser credit via rubric-based judgments at each stage. In parallel it trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable, rubric-grounded guidance. The resulting 8B agent shows strong results on long-form research benchmarks while the framework supplies analyses of the contribution of each component.
What carries the argument
Rubric-guided policy decomposition paired with reflection-based meta-policy evolution, where rubrics serve as the interface for execution structure, semantic feedback, and reusable memory.
If this is right
- Stage-aware conditioning on self-generated rubrics turns long-horizon research trajectories into shorter, credit-assignable segments.
- Stage-Structured GRPO supplies semantic feedback at each stage instead of waiting for a final unverifiable answer.
- The reflection meta-policy converts judged experience into reusable rubric-grounded instructions that improve future attempts.
- An 8B model trained this way outperforms comparable open models and approaches proprietary deep-research systems on four benchmarks.
Where Pith is reading between the lines
- The same rubric interface could be applied to other agentic domains such as code debugging or scientific hypothesis generation where rewards are also subjective.
- If rubric quality scales with model size, larger backbones might close the remaining gap to proprietary systems without additional human labeling.
- The framework suggests that memory in long-horizon agents can be made explicit and queryable by storing rubric-grounded reflections rather than raw trajectories.
Load-bearing premise
Self-generated rubrics can reliably structure policy execution, deliver effective stagewise semantic feedback, and support reusable guidance through reflection for trajectories that lack ground-truth answers.
What would settle it
A controlled ablation on the same four benchmarks in which removing either the stagewise rubric judgments or the reflection meta-policy produces no gain or a clear drop relative to a standard GRPO baseline would falsify the central claim.
read the original abstract
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RubricEM, a rubric-guided meta-RL framework for deep research agents that plan, search, evaluate evidence, and synthesize long-form reports. It decomposes trajectories into stages (planning, evidence gathering, review, synthesis) conditioned on self-generated rubrics, assigns credit via Stage-Structured GRPO using stagewise rubric judgments for denser semantic feedback, and trains a shared-backbone reflection meta-policy to distill judged trajectories into reusable rubric-grounded guidance. The resulting RubricEM-8B is claimed to achieve strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems, with additional analyses of key ingredients.
Significance. If the empirical results hold, this work would be significant for extending RL to non-verifiable reward regimes in complex agentic tasks. By treating rubrics as a shared interface for policy execution, feedback, and memory, it provides a concrete mechanism for turning long-horizon research attempts into reusable experience, which is a notable gap in current post-training methods. The combination of stagewise decomposition and meta-policy evolution could influence designs for LLM agents handling open-ended, multi-step reasoning.
major comments (2)
- Abstract: The manuscript states strong performance claims for RubricEM-8B but provides no experimental details, baselines, metrics, benchmark names, or quantitative results. This prevents any verification of whether the data support the central claim of outperforming open models and approaching proprietary systems.
- Framework section (Stage-Structured GRPO and meta-policy): The description of how stagewise rubric judgments provide denser feedback and how the reflection meta-policy distills trajectories is high-level only, with no equations, pseudocode, or loss formulations. Without these, it is impossible to assess whether the approach is internally consistent or novel relative to standard GRPO/PPO variants.
minor comments (1)
- The acronym GRPO is used without expansion or prior definition on first use.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the potential impact of our work on RL for complex agentic tasks. We address each major comment below and will incorporate revisions to enhance clarity and verifiability.
read point-by-point responses
-
Referee: Abstract: The manuscript states strong performance claims for RubricEM-8B but provides no experimental details, baselines, metrics, benchmark names, or quantitative results. This prevents any verification of whether the data support the central claim of outperforming open models and approaching proprietary systems.
Authors: We acknowledge that the abstract presents the performance claims at a high level without specific details. The manuscript provides these in Section 4, including benchmark names, baselines, metrics such as report quality scores, and quantitative comparisons showing outperformance over open models and competitiveness with proprietary systems. We will revise the abstract to include the benchmark names and a summary of the results to facilitate verification. revision: yes
-
Referee: Framework section (Stage-Structured GRPO and meta-policy): The description of how stagewise rubric judgments provide denser feedback and how the reflection meta-policy distills trajectories is high-level only, with no equations, pseudocode, or loss formulations. Without these, it is impossible to assess whether the approach is internally consistent or novel relative to standard GRPO/PPO variants.
Authors: We agree that additional formalization would strengthen the presentation. The full paper includes algorithmic descriptions, but to address this, we will add explicit equations for the stage-structured advantage estimation and the meta-policy distillation loss, along with pseudocode for the RubricEM training loop, in the revised Framework section. revision: yes
Circularity Check
No significant circularity; framework is self-contained
full rationale
The provided manuscript text consists of an abstract and high-level description of RubricEM without any equations, derivations, fitted parameters, or load-bearing self-citations. The framework is introduced as a novel combination of stagewise policy decomposition, Stage-Structured GRPO, and reflection-based meta-policy evolution, with performance evaluated on external benchmarks. No step reduces by construction to its inputs, and the central claims rest on empirical results rather than self-referential definitions or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-generated rubrics can serve as a shared interface for structuring policy execution, judge feedback, and agent memory in research tasks without ground-truth answers.
Reference graph
Works this paper leans on
-
[1]
Technical report. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents...
-
[2]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
URLhttps://openreview.net/forum?id=4GiBscHW1k. Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025. Jiazheng Li, Yawei Wang, Qiaojing Yan, Yijun Tian, Zhichao Xu, Huan Song, Panpan Xu, and Lin Lee Cheong...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Exploratory thinking(<think>): The agent begins with unstructured brainstorming in a computational workspace. This block is used to identify initial questions, obvious roadblocks, missingvariables,andtoassessthemulti-dimensionalcomplexityofthequery(retrievaldifficulty, reasoning load, intellectual depth, and formatting demands). No structured XML tags are...
-
[4]
If X confirms Y, investigate Z; if X is inconclusive, fallback to W
Structured plan(<structured_plan>): The agent then produces a visible, structured plan- ning document containing exactly three sub-components: •<deep_analysis> : Deconstructs the user’s query into explicit needs (what they directly ask for), implicit needs and gaps (hidden constraints, missing variables, potential road- blocks), and a complexity assessmen...
-
[5]
First tool call(<call_tool>): Immediately after the plan, the agent executes the first step of its research plan. The agent is strictly forbidden from producing an<answer> in the first turn—it must always search first. Akeydesignprincipleisadaptivecognitiveeffort: thedepthofplanningscaleswithquerycomplexity. For a simple factual lookup, the analysis and r...
-
[6]
Evaluation thinking(<think>): Digests new evidence in an unstructured inner monologue, noting conflicts, necessary pivots, or writing hurdles
-
[7]
State evaluation(<state_evaluation>): Produces a visible evaluation of the current state of evidence, comparing accumulated findings against the rubric and research plan. Based on this evaluation, the agent chooses one of two paths: • Path A—Continue research:If more information is needed, the agent either (a) issues another<call_tool>directly if the curr...
-
[8]
Exact Match: Format strictly as \boxed{exact answer}
-
[9]
Short-Form: Write a single, cohesive paragraph with <cite id="...">...</cite> support
-
[10]
Synthesize sources into a cohesive narrative
Long-Form: Write a comprehensive, markdown-structured response. Synthesize sources into a cohesive narrative. Ground all nontrivial claims with <cite id="...">...</cite>. CRITICAL: formatting instructions appended to the user’s prompt dictate the shape of your answer, but your dynamically updated <rubric> dictates the substance. --- ## PHASE 1: PLANNING &...
-
[11]
<deep_analysis>: Discover the True Intent. - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation). - Explicit Needs: What they are directly asking for. - Implicit Needs & Gaps: Hidden constraints, missing variables, or potential roadblocks
-
[12]
Act as an expert grader creating a rigorous checklist
<rubric>: Define the Strict Grading Criteria. Act as an expert grader creating a rigorous checklist. Draw upon the <deep_analysis> as a foundation. DO NOT focus on formatting; focus on required content, intellectual depth, and logical constraints. - Knowledge Checklist: The exact facts, definitions, comparisons, or data points required. - Analytical & Syn...
-
[13]
Create a logical roadmap to satisfy the rubric
<research_plan>: Formulate the Strategy. Create a logical roadmap to satisfy the rubric. Simple queries need a linear step or two; complex queries require a conditional, look-ahead strategy. STEP 2: The First Tool Call Immediately after closing </structured_plan>, execute ONLY the first step of your research plan. Under NO circumstances should you output ...
-
[14]
Verify knowledge checklist items
The Review: Output a <review> block containing: - <rubric_review>: Systematically map retrieved evidence back to the Phase 1 <rubric>. Verify knowledge checklist items. Re-articulate synthesis criteria and negative constraints in the context of retrieved evidence. - <writing_plan>: Outline the final answer’s architecture. For complex queries: define a uni...
-
[15]
The Final Answer: Output the response within <answer>...</answer> tags. --- ## Available Tools - google_search: Powered by a grounded AI reasoning engine. Use direct, highly specific search queries. Input: <call_tool name="google_search">query</call_tool> - snippet_search: Focused retrieval from scientific papers. Input: <call_tool name="snippet_search">q...
work page 2025
-
[16]
<deep_analysis>: Discover the True Intent. - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation). - Explicit Needs: What they are directly asking for (including structural/formatting instructions). 27 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards - Implicit Nee...
-
[17]
The response explicitly defines the von Neumann bottleneck
<rubric>: Define the Strict Grading Criteria. Act as an expert grader creating a rigorous checklist. DO NOT focus on formatting; focus on content, depth, and constraints. - Knowledge Checklist: Exact facts, definitions, data points required (e.g., "The response explicitly defines the von Neumann bottleneck"). - Analytical & Synthesis Criteria (Optional, f...
-
[18]
Step 1: Find X. Step 2: If X confirms Y, investigate Z; if inconclusive, fallback to W
<research_plan>: Formulate the Strategy. Create a logical roadmap to satisfy the rubric. Simple queries: linear steps. Complex queries: conditional, look-ahead strategy (e.g., "Step 1: Find X. Step 2: If X confirms Y, investigate Z; if inconclusive, fallback to W"). STEP 2: The First Tool Call Immediately after closing </structured_plan>, execute ONLY the...
-
[19]
(Only include sections that require updates.)
Output an updated <structured_plan> with revised <deep_analysis>, <rubric>, and/or <research_plan>. (Only include sections that require updates.)
-
[20]
PATH B: Ready to Answer (Evidence Sufficient)
Then output your next <call_tool>. PATH B: Ready to Answer (Evidence Sufficient)
-
[21]
The Review: Output a <review> block containing: - <rubric_review>: Systematically map retrieved evidence back to the Phase 1 <rubric> (plus any dynamic updates). * Knowledge Verification: List the critical facts and data points that satisfy the Knowledge Checklist. * Synthesis & Constraints Restatement: Re-articulate Analytical Criteria and Negative Const...
-
[22]
Depth and formatting MUST match the writing plan
The Final Answer: Output within <answer>...</answer> tags. Depth and formatting MUST match the writing plan. Follow the Knowledge Checklist and honor all constraints from the <rubric_review>. ## CRITICAL CONSTRAINTS - Output MUST begin with <scratchpad>, then <state_evaluation>. - One Action Per Turn: end with </call_tool> OR </answer>. STOP generating im...
-
[23]
Round 1 (Initiation):The teacher receives the first-round system prompt and user query. It generates Phase 1 output:<scratchpad>,<structured_plan> (with<deep_analysis>, <rubric>,<research_plan>), and the first<call_tool>. Generation stops at the closing </call_tool>tag
-
[24]
Tool routing and execution:Each<call_tool> is parsed to extract the tool name and query. The pipeline routes calls to one of two backends: •google_search : Submitted as Vertex AI batch prediction jobs using Gemini with Google Search grounding enabled, which returns AI-synthesized summaries with grounding snippets. •snippet_search : Executed asynchronously...
-
[25]
Rounds 2–𝑁 (Research iteration):The tool output is appended to the conversation history, and the teacher receives the later-rounds system prompt. It generates the next Phase 2 turn: <scratchpad>,<state_evaluation>, and either another<call_tool> or the<review> and<answer>. This loop continues until the teacher produces a closing</answer> tag or a maximum o...
-
[26]
Completion detection:A trajectory is considered complete when the teacher’s output ends with </answer>. Incomplete trajectories that exhaust the maximum rounds are discarded. Because Gemini often fails to produce a well-formed closing tag, a substantial fraction of trajectories (∼15–25%) are discarded at this stage. B.2.3. Quality Filtering and Rejection ...
work page 2025
-
[27]
Main thread (training engine):Performs gradient updates on the policy. Alternates between Phase A (meta-policy update on deferred reflections from the previous step) and Phase B (task-policy SS-GRPO update on the current step’s rollouts)
-
[28]
Inference thread (vLLM):Generates multi-turn tool-augmented rollouts. During Phase A, the inference engine begins generating the next batch of rollouts while the training engine trains on deferred reflections—fully overlapping inference and reflection training
-
[29]
Data preparation thread:Runs judge scoring (async Gemini API calls), launches reflection generation and reflection judging as concurrent background tasks, and prepares packed training batches. Reward scoring results are returned immediately for task-policy training; reflection scoring continues asynchronously. 35 RubricEM: Meta-RL with Rubric-guided Polic...
-
[30]
The inference engine generates task rollouts for step𝑁
-
[31]
The data preparation thread scores step𝑁 rollouts (SS-GRPO rewards) and simultaneously launchesbackgroundtasksfor: (a)reflectionrolloutgenerationviavLLM,producing 𝑛candidates per sampled trajectory, and (b) judge scoring of the reflection candidates
-
[32]
Reward results are returned immediately; the training engine performs the SS-GRPO task-policy update on step𝑁
-
[33]
Meanwhile, reflection scoring completes in the background. Accepted reflections are inserted into the rubric bank, and the scored reflection samples are placed in adeferred buffer
-
[34]
At the start of step𝑁+1, duringPhase A, the training engine trains on the deferred reflection buffer from step𝑁whilethe inference engine generates step𝑁+1rollouts. This one-step staleness trades exact synchrony for higher infrastructure utilization: both inference and training engines remain continuously occupied, and meta-policy training adds effectively...
-
[35]
, 𝐾 within the window): The dataloader samples𝐾 fresh query batches {𝐵 1, 𝐵2,
New-query phase(steps1 , . . . , 𝐾 within the window): The dataloader samples𝐾 fresh query batches {𝐵 1, 𝐵2, . . . , 𝐵 𝐾 }. For each batch, the bank performscross-episode retrieval: semantically similar items from past (different) queries are injected as few-shot exemplars. At the end of each step, reflection generation is launched asynchronously in the b...
-
[36]
, 2𝐾 within the window): The same batches are replayed in order: 𝐵1, 𝐵2,
Repeat phase(steps 𝐾+1, . . . , 2𝐾 within the window): The same batches are replayed in order: 𝐵1, 𝐵2, . . . , 𝐵 𝐾. For each batch, the bank performswithin-episode retrieval: the exact reflection generated during the new-query phase is retrieved and injected as direct self-guidance. The 𝐾-step gap between a query’s first encounter and its repeat guarantee...
-
[37]
a stronger next attempt on this question, and 37 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
-
[38]
similar long-form research questions. First, read the final <answer>. Judge it on: correctness and calibration; instruction-following and coverage; research quality and verification; synthesis and insight; communication and structure. Then identify the 2-4 highest-leverage strengths or weaknesses, trace them back to the visible trajectory, and turn them i...
-
[39]
The law of𝐻 is denoted by𝑑 :=ℙ◦𝐻 −1
Underlying probability space.There exists a probability space(Ω,F,ℙ) and a random decision history 𝐻:Ω→ H, where (H,B (H )) is a measurable space. The law of𝐻 is denoted by𝑑 :=ℙ◦𝐻 −1. We interpret𝐻 as a random reachable decision point sampled from some fixed distribution over histories
-
[40]
Consequently, every maximization and argmax overAis well-defined and nonempty
Finite action space.The action spaceA is a finite nonempty set. Consequently, every maximization and argmax overAis well-defined and nonempty. 3.Compressed context and stage label.There exists a measurable map 𝜙:H → C into a standard Borel space(C,B (C)), and a measurable stage map 𝜓:H → [𝐾]:={1, . . . , 𝐾}, such that 𝐶:=𝜙(𝐻), 𝑍:=𝜓(𝐻). SinceCis standard B...
-
[41]
max 𝑎∈ A 𝐾∑︁ 𝑧=1 𝑝(𝑧|𝐶)𝑞(𝐶, 𝑧, 𝑎) # . Then we can simply see that by Definition 3 𝑉stage −𝑉 flat =𝔼
There exists an action 𝑎★ 𝑐 ∈ Ù 𝑧∈𝑆(𝑐) arg max 𝑎∈ A 𝑞(𝑐, 𝑧, 𝑎). In particular, if there exists a measurable setC0 ⊆ Cwith ℙ𝐶 (C0)>0 and two distinct stages𝑧≠𝑧 ′ such that for every𝑐∈ C0, 𝑝(𝑧|𝑐)>0, 𝑝(𝑧 ′ |𝑐)>0,arg max 𝑎∈ A 𝑞(𝑐, 𝑧, 𝑎) ∩arg max 𝑎∈ A 𝑞(𝑐, 𝑧 ′, 𝑎)=∅, then 𝑉stage > 𝑉flat. Proof. First, werewrite 𝑉stage and 𝑉flat toshowtheirsubtractionformasin 𝔼...
work page 2019
-
[42]
This corresponds to analyzing a single parameter update while treating the bank state as fixed
Current rubric bank.At the current training step, the rubric bank is a fixed measurable objectM. This corresponds to analyzing a single parameter update while treating the bank state as fixed. 3.Task rollout distribution.Given query𝑄=𝑞and bankM, a rollout 𝑇∼𝑝 𝜃(· |𝑞,M) is sampled from the deployed task policy
-
[43]
Fixed reflection-context distribution.There exists a fixed measurable distribution𝜉 over query– trajectory pairs (e𝑄,e𝑇) ∼𝜉. One may think of𝜉 as the distribution of trajectories selected for reflection training at the current update, e.g., after first sampling rollout groups under the current behavior policy and then selecting one trajectory per query. F...
-
[44]
Shared-backbone reflection generation.Given(e𝑄,e𝑇)=(𝑞, 𝜏) and bankM, a rubric-grounded reflection 𝑆∼𝑟 𝜃(· |𝑞, 𝜏,M) is sampled. Although we write 𝑝𝜃 and 𝑟𝜃 separately for clarity, both are induced by the same underlying autoregressive backbone𝜋𝜃 under different prompts / contexts
-
[45]
Task score, reflection-utility scores, and judge gate.There exist measurable, integrable random quantities 𝑅(𝑄, 𝑇;M) ∈ℝ,Δ w(e𝑄,e𝑇, 𝑆;M) ∈ℝ,Δ c(e𝑄,e𝑇, 𝑆;M) ∈ℝ, and a measurable acceptance indicator 𝐴(e𝑄,e𝑇, 𝑆;M) ∈ {0,1}. Here 𝑅 is the judged task score of the deployed rollout,Δw is the judged within-episode usefulness of the reflection,Δc is the judged cro...
-
[46]
For 𝜉-almost every (𝑞, 𝜏), the conditional distribution 𝑟𝜃(𝑠|𝑞, 𝜏,M) is differentiable in𝜃
Differentiability and score-function regularity.ForD-almost every𝑞, the conditional distri- bution 𝑝𝜃(𝜏|𝑞,M) is differentiable in𝜃. For 𝜉-almost every (𝑞, 𝜏), the conditional distribution 𝑟𝜃(𝑠|𝑞, 𝜏,M) is differentiable in𝜃. Differentiation may be interchanged with the expectations below, and all score-function terms introduced later are square-integrable
-
[47]
Moreover, 𝐽task is𝐿 𝐽-smooth and𝑈is𝐿 𝑈-smooth onΘ
Smoothness.The objectives 𝐽task and 𝑈 defined below are continuously differentiable. Moreover, 𝐽task is𝐿 𝐽-smooth and𝑈is𝐿 𝑈-smooth onΘ. Definition 4(Task objective and judge-gated memory objective).Adopt Assumption 3. Define the task-rollout and reflection score-function sums Γtraj :=∇ 𝜃 log𝑝 𝜃(𝑇|𝑄,M),Γ ref :=∇ 𝜃 log𝑟 𝜃(𝑆| e𝑄,e𝑇,M). Because both condition...
work page 2025
-
[48]
The SFT data generation process is described in Appendix B.2
with DeepSpeed ZeRO-3. The SFT data generation process is described in Appendix B.2. Key hyperparameters: Hyperparameter Value Base model Qwen3-8B Training epochs 5 Learning rate4×10 −5 LR scheduler Cosine with 10% warmup Batch size (per device) 1 Gradient accumulation steps 16 Effective batch size 128 (8 GPUs×16 accum.) Max sequence length 16,384 tokens ...
work page 2025
-
[49]
Scores for existing baselines are taken from the Dr
and supplementing with additional recent models. Scores for existing baselines are taken from the Dr. Tulu paper where available; for models not covered, we run the same evaluation pipeline. • Closed deep research models:Proprietary systems with full deep research capabilities, including OpenAI Deep Research (OpenAI, 2025), Gemini Deep Research (Google De...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.