pith. machine review for the scientific record. sign in

arxiv: 2605.10899 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.LG

Recognition: no theorem link

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords meta reinforcement learningrubric-guided policypolicy decompositionresearch agentslong-form synthesisstagewise credit assignmentreflection meta-policy
0
0 comments X

The pith

Rubrics act as the shared interface for structuring, judging, and remembering research trajectories in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard reinforcement learning struggles with deep research agents because their outputs lack ground-truth answers and their long tool-using trajectories offer little reusable signal. RubricEM instead treats self-generated rubrics as the common language that conditions every stage of work, supplies stage-by-stage semantic judgments, and feeds a reflection meta-policy that turns past attempts into guidance for new ones. This decomposition plus meta-evolution produces an 8B model that outperforms open baselines and nears proprietary systems on four long-form research benchmarks. The approach therefore extends meta-RL past verifiable-reward settings into open-ended synthesis tasks.

Core claim

RubricEM first decomposes research trajectories into stage-aware policies by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then applies Stage-Structured GRPO to deliver denser credit via rubric-based judgments at each stage. In parallel it trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable, rubric-grounded guidance. The resulting 8B agent shows strong results on long-form research benchmarks while the framework supplies analyses of the contribution of each component.

What carries the argument

Rubric-guided policy decomposition paired with reflection-based meta-policy evolution, where rubrics serve as the interface for execution structure, semantic feedback, and reusable memory.

If this is right

  • Stage-aware conditioning on self-generated rubrics turns long-horizon research trajectories into shorter, credit-assignable segments.
  • Stage-Structured GRPO supplies semantic feedback at each stage instead of waiting for a final unverifiable answer.
  • The reflection meta-policy converts judged experience into reusable rubric-grounded instructions that improve future attempts.
  • An 8B model trained this way outperforms comparable open models and approaches proprietary deep-research systems on four benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rubric interface could be applied to other agentic domains such as code debugging or scientific hypothesis generation where rewards are also subjective.
  • If rubric quality scales with model size, larger backbones might close the remaining gap to proprietary systems without additional human labeling.
  • The framework suggests that memory in long-horizon agents can be made explicit and queryable by storing rubric-grounded reflections rather than raw trajectories.

Load-bearing premise

Self-generated rubrics can reliably structure policy execution, deliver effective stagewise semantic feedback, and support reusable guidance through reflection for trajectories that lack ground-truth answers.

What would settle it

A controlled ablation on the same four benchmarks in which removing either the stagewise rubric judgments or the reflection meta-policy produces no gain or a clear drop relative to a standard GRPO baseline would falsify the central claim.

read the original abstract

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RubricEM, a rubric-guided meta-RL framework for deep research agents that plan, search, evaluate evidence, and synthesize long-form reports. It decomposes trajectories into stages (planning, evidence gathering, review, synthesis) conditioned on self-generated rubrics, assigns credit via Stage-Structured GRPO using stagewise rubric judgments for denser semantic feedback, and trains a shared-backbone reflection meta-policy to distill judged trajectories into reusable rubric-grounded guidance. The resulting RubricEM-8B is claimed to achieve strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems, with additional analyses of key ingredients.

Significance. If the empirical results hold, this work would be significant for extending RL to non-verifiable reward regimes in complex agentic tasks. By treating rubrics as a shared interface for policy execution, feedback, and memory, it provides a concrete mechanism for turning long-horizon research attempts into reusable experience, which is a notable gap in current post-training methods. The combination of stagewise decomposition and meta-policy evolution could influence designs for LLM agents handling open-ended, multi-step reasoning.

major comments (2)
  1. Abstract: The manuscript states strong performance claims for RubricEM-8B but provides no experimental details, baselines, metrics, benchmark names, or quantitative results. This prevents any verification of whether the data support the central claim of outperforming open models and approaching proprietary systems.
  2. Framework section (Stage-Structured GRPO and meta-policy): The description of how stagewise rubric judgments provide denser feedback and how the reflection meta-policy distills trajectories is high-level only, with no equations, pseudocode, or loss formulations. Without these, it is impossible to assess whether the approach is internally consistent or novel relative to standard GRPO/PPO variants.
minor comments (1)
  1. The acronym GRPO is used without expansion or prior definition on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for highlighting the potential impact of our work on RL for complex agentic tasks. We address each major comment below and will incorporate revisions to enhance clarity and verifiability.

read point-by-point responses
  1. Referee: Abstract: The manuscript states strong performance claims for RubricEM-8B but provides no experimental details, baselines, metrics, benchmark names, or quantitative results. This prevents any verification of whether the data support the central claim of outperforming open models and approaching proprietary systems.

    Authors: We acknowledge that the abstract presents the performance claims at a high level without specific details. The manuscript provides these in Section 4, including benchmark names, baselines, metrics such as report quality scores, and quantitative comparisons showing outperformance over open models and competitiveness with proprietary systems. We will revise the abstract to include the benchmark names and a summary of the results to facilitate verification. revision: yes

  2. Referee: Framework section (Stage-Structured GRPO and meta-policy): The description of how stagewise rubric judgments provide denser feedback and how the reflection meta-policy distills trajectories is high-level only, with no equations, pseudocode, or loss formulations. Without these, it is impossible to assess whether the approach is internally consistent or novel relative to standard GRPO/PPO variants.

    Authors: We agree that additional formalization would strengthen the presentation. The full paper includes algorithmic descriptions, but to address this, we will add explicit equations for the stage-structured advantage estimation and the meta-policy distillation loss, along with pseudocode for the RubricEM training loop, in the revised Framework section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained

full rationale

The provided manuscript text consists of an abstract and high-level description of RubricEM without any equations, derivations, fitted parameters, or load-bearing self-citations. The framework is introduced as a novel combination of stagewise policy decomposition, Stage-Structured GRPO, and reflection-based meta-policy evolution, with performance evaluated on external benchmarks. No step reduces by construction to its inputs, and the central claims rest on empirical results rather than self-referential definitions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text.

axioms (1)
  • domain assumption Self-generated rubrics can serve as a shared interface for structuring policy execution, judge feedback, and agent memory in research tasks without ground-truth answers.
    This premise underpins the entire RubricEM design as stated in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1205 out tokens · 144874 ms · 2026-05-12T03:45:34.127664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

  1. [1]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria

    Technical report. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm.Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents...

  2. [2]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    URLhttps://openreview.net/forum?id=4GiBscHW1k. Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025. Jiazheng Li, Yawei Wang, Qiaojing Yan, Yijun Tian, Zhichao Xu, Huan Song, Panpan Xu, and Lin Lee Cheong...

  3. [3]

    Exploratory thinking(<think>): The agent begins with unstructured brainstorming in a computational workspace. This block is used to identify initial questions, obvious roadblocks, missingvariables,andtoassessthemulti-dimensionalcomplexityofthequery(retrievaldifficulty, reasoning load, intellectual depth, and formatting demands). No structured XML tags are...

  4. [4]

    If X confirms Y, investigate Z; if X is inconclusive, fallback to W

    Structured plan(<structured_plan>): The agent then produces a visible, structured plan- ning document containing exactly three sub-components: •<deep_analysis> : Deconstructs the user’s query into explicit needs (what they directly ask for), implicit needs and gaps (hidden constraints, missing variables, potential road- blocks), and a complexity assessmen...

  5. [5]

    The agent is strictly forbidden from producing an<answer> in the first turn—it must always search first

    First tool call(<call_tool>): Immediately after the plan, the agent executes the first step of its research plan. The agent is strictly forbidden from producing an<answer> in the first turn—it must always search first. Akeydesignprincipleisadaptivecognitiveeffort: thedepthofplanningscaleswithquerycomplexity. For a simple factual lookup, the analysis and r...

  6. [6]

    Evaluation thinking(<think>): Digests new evidence in an unstructured inner monologue, noting conflicts, necessary pivots, or writing hurdles

  7. [7]

    SNIPPET_ID

    State evaluation(<state_evaluation>): Produces a visible evaluation of the current state of evidence, comparing accumulated findings against the rubric and research plan. Based on this evaluation, the agent chooses one of two paths: • Path A—Continue research:If more information is needed, the agent either (a) issues another<call_tool>directly if the curr...

  8. [8]

    Exact Match: Format strictly as \boxed{exact answer}

  9. [9]

    Short-Form: Write a single, cohesive paragraph with <cite id="...">...</cite> support

  10. [10]

    Synthesize sources into a cohesive narrative

    Long-Form: Write a comprehensive, markdown-structured response. Synthesize sources into a cohesive narrative. Ground all nontrivial claims with <cite id="...">...</cite>. CRITICAL: formatting instructions appended to the user’s prompt dictate the shape of your answer, but your dynamically updated <rubric> dictates the substance. --- ## PHASE 1: PLANNING &...

  11. [11]

    - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation)

    <deep_analysis>: Discover the True Intent. - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation). - Explicit Needs: What they are directly asking for. - Implicit Needs & Gaps: Hidden constraints, missing variables, or potential roadblocks

  12. [12]

    Act as an expert grader creating a rigorous checklist

    <rubric>: Define the Strict Grading Criteria. Act as an expert grader creating a rigorous checklist. Draw upon the <deep_analysis> as a foundation. DO NOT focus on formatting; focus on required content, intellectual depth, and logical constraints. - Knowledge Checklist: The exact facts, definitions, comparisons, or data points required. - Analytical & Syn...

  13. [13]

    Create a logical roadmap to satisfy the rubric

    <research_plan>: Formulate the Strategy. Create a logical roadmap to satisfy the rubric. Simple queries need a linear step or two; complex queries require a conditional, look-ahead strategy. STEP 2: The First Tool Call Immediately after closing </structured_plan>, execute ONLY the first step of your research plan. Under NO circumstances should you output ...

  14. [14]

    Verify knowledge checklist items

    The Review: Output a <review> block containing: - <rubric_review>: Systematically map retrieved evidence back to the Phase 1 <rubric>. Verify knowledge checklist items. Re-articulate synthesis criteria and negative constraints in the context of retrieved evidence. - <writing_plan>: Outline the final answer’s architecture. For complex queries: define a uni...

  15. [15]

    google_search

    The Final Answer: Output the response within <answer>...</answer> tags. --- ## Available Tools - google_search: Powered by a grounded AI reasoning engine. Use direct, highly specific search queries. Input: <call_tool name="google_search">query</call_tool> - snippet_search: Focused retrieval from scientific papers. Input: <call_tool name="snippet_search">q...

  16. [16]

    - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation)

    <deep_analysis>: Discover the True Intent. - Complexity Assessment: Evaluate the multi-dimensional complexity (retrieval, reasoning, insight, formulation). - Explicit Needs: What they are directly asking for (including structural/formatting instructions). 27 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards - Implicit Nee...

  17. [17]

    The response explicitly defines the von Neumann bottleneck

    <rubric>: Define the Strict Grading Criteria. Act as an expert grader creating a rigorous checklist. DO NOT focus on formatting; focus on content, depth, and constraints. - Knowledge Checklist: Exact facts, definitions, data points required (e.g., "The response explicitly defines the von Neumann bottleneck"). - Analytical & Synthesis Criteria (Optional, f...

  18. [18]

    Step 1: Find X. Step 2: If X confirms Y, investigate Z; if inconclusive, fallback to W

    <research_plan>: Formulate the Strategy. Create a logical roadmap to satisfy the rubric. Simple queries: linear steps. Complex queries: conditional, look-ahead strategy (e.g., "Step 1: Find X. Step 2: If X confirms Y, investigate Z; if inconclusive, fallback to W"). STEP 2: The First Tool Call Immediately after closing </structured_plan>, execute ONLY the...

  19. [19]

    (Only include sections that require updates.)

    Output an updated <structured_plan> with revised <deep_analysis>, <rubric>, and/or <research_plan>. (Only include sections that require updates.)

  20. [20]

    PATH B: Ready to Answer (Evidence Sufficient)

    Then output your next <call_tool>. PATH B: Ready to Answer (Evidence Sufficient)

  21. [21]

    * Knowledge Verification: List the critical facts and data points that satisfy the Knowledge Checklist

    The Review: Output a <review> block containing: - <rubric_review>: Systematically map retrieved evidence back to the Phase 1 <rubric> (plus any dynamic updates). * Knowledge Verification: List the critical facts and data points that satisfy the Knowledge Checklist. * Synthesis & Constraints Restatement: Re-articulate Analytical Criteria and Negative Const...

  22. [22]

    Depth and formatting MUST match the writing plan

    The Final Answer: Output within <answer>...</answer> tags. Depth and formatting MUST match the writing plan. Follow the Knowledge Checklist and honor all constraints from the <rubric_review>. ## CRITICAL CONSTRAINTS - Output MUST begin with <scratchpad>, then <state_evaluation>. - One Action Per Turn: end with </call_tool> OR </answer>. STOP generating im...

  23. [23]

    It generates Phase 1 output:<scratchpad>,<structured_plan> (with<deep_analysis>, <rubric>,<research_plan>), and the first<call_tool>

    Round 1 (Initiation):The teacher receives the first-round system prompt and user query. It generates Phase 1 output:<scratchpad>,<structured_plan> (with<deep_analysis>, <rubric>,<research_plan>), and the first<call_tool>. Generation stops at the closing </call_tool>tag

  24. [24]

    Tool routing and execution:Each<call_tool> is parsed to extract the tool name and query. The pipeline routes calls to one of two backends: •google_search : Submitted as Vertex AI batch prediction jobs using Gemini with Google Search grounding enabled, which returns AI-synthesized summaries with grounding snippets. •snippet_search : Executed asynchronously...

  25. [25]

    It generates the next Phase 2 turn: <scratchpad>,<state_evaluation>, and either another<call_tool> or the<review> and<answer>

    Rounds 2–𝑁 (Research iteration):The tool output is appended to the conversation history, and the teacher receives the later-rounds system prompt. It generates the next Phase 2 turn: <scratchpad>,<state_evaluation>, and either another<call_tool> or the<review> and<answer>. This loop continues until the teacher produces a closing</answer> tag or a maximum o...

  26. [26]

    no grounding data available

    Completion detection:A trajectory is considered complete when the teacher’s output ends with </answer>. Incomplete trajectories that exhaust the maximum rounds are discarded. Because Gemini often fails to produce a well-formed closing tag, a substantial fraction of trajectories (∼15–25%) are discarded at this stage. B.2.3. Quality Filtering and Rejection ...

  27. [27]

    Alternates between Phase A (meta-policy update on deferred reflections from the previous step) and Phase B (task-policy SS-GRPO update on the current step’s rollouts)

    Main thread (training engine):Performs gradient updates on the policy. Alternates between Phase A (meta-policy update on deferred reflections from the previous step) and Phase B (task-policy SS-GRPO update on the current step’s rollouts)

  28. [28]

    Inference thread (vLLM):Generates multi-turn tool-augmented rollouts. During Phase A, the inference engine begins generating the next batch of rollouts while the training engine trains on deferred reflections—fully overlapping inference and reflection training

  29. [29]

    Reward scoring results are returned immediately for task-policy training; reflection scoring continues asynchronously

    Data preparation thread:Runs judge scoring (async Gemini API calls), launches reflection generation and reflection judging as concurrent background tasks, and prepares packed training batches. Reward scoring results are returned immediately for task-policy training; reflection scoring continues asynchronously. 35 RubricEM: Meta-RL with Rubric-guided Polic...

  30. [30]

    The inference engine generates task rollouts for step𝑁

  31. [31]

    The data preparation thread scores step𝑁 rollouts (SS-GRPO rewards) and simultaneously launchesbackgroundtasksfor: (a)reflectionrolloutgenerationviavLLM,producing 𝑛candidates per sampled trajectory, and (b) judge scoring of the reflection candidates

  32. [32]

    Reward results are returned immediately; the training engine performs the SS-GRPO task-policy update on step𝑁

  33. [33]

    Accepted reflections are inserted into the rubric bank, and the scored reflection samples are placed in adeferred buffer

    Meanwhile, reflection scoring completes in the background. Accepted reflections are inserted into the rubric bank, and the scored reflection samples are placed in adeferred buffer

  34. [34]

    At the start of step𝑁+1, duringPhase A, the training engine trains on the deferred reflection buffer from step𝑁whilethe inference engine generates step𝑁+1rollouts. This one-step staleness trades exact synchrony for higher infrastructure utilization: both inference and training engines remain continuously occupied, and meta-policy training adds effectively...

  35. [35]

    , 𝐾 within the window): The dataloader samples𝐾 fresh query batches {𝐵 1, 𝐵2,

    New-query phase(steps1 , . . . , 𝐾 within the window): The dataloader samples𝐾 fresh query batches {𝐵 1, 𝐵2, . . . , 𝐵 𝐾 }. For each batch, the bank performscross-episode retrieval: semantically similar items from past (different) queries are injected as few-shot exemplars. At the end of each step, reflection generation is launched asynchronously in the b...

  36. [36]

    , 2𝐾 within the window): The same batches are replayed in order: 𝐵1, 𝐵2,

    Repeat phase(steps 𝐾+1, . . . , 2𝐾 within the window): The same batches are replayed in order: 𝐵1, 𝐵2, . . . , 𝐵 𝐾. For each batch, the bank performswithin-episode retrieval: the exact reflection generated during the new-query phase is retrieved and injected as direct self-guidance. The 𝐾-step gap between a query’s first encounter and its repeat guarantee...

  37. [37]

    a stronger next attempt on this question, and 37 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

  38. [38]

    be more careful

    similar long-form research questions. First, read the final <answer>. Judge it on: correctness and calibration; instruction-following and coverage; research quality and verification; synthesis and insight; communication and structure. Then identify the 2-4 highest-leverage strengths or weaknesses, trace them back to the visible trajectory, and turn them i...

  39. [39]

    The law of𝐻 is denoted by𝑑 :=ℙ◦𝐻 −1

    Underlying probability space.There exists a probability space(Ω,F,ℙ) and a random decision history 𝐻:Ω→ H, where (H,B (H )) is a measurable space. The law of𝐻 is denoted by𝑑 :=ℙ◦𝐻 −1. We interpret𝐻 as a random reachable decision point sampled from some fixed distribution over histories

  40. [40]

    Consequently, every maximization and argmax overAis well-defined and nonempty

    Finite action space.The action spaceA is a finite nonempty set. Consequently, every maximization and argmax overAis well-defined and nonempty. 3.Compressed context and stage label.There exists a measurable map 𝜙:H → C into a standard Borel space(C,B (C)), and a measurable stage map 𝜓:H → [𝐾]:={1, . . . , 𝐾}, such that 𝐶:=𝜙(𝐻), 𝑍:=𝜓(𝐻). SinceCis standard B...

  41. [41]

    max 𝑎∈ A 𝐾∑︁ 𝑧=1 𝑝(𝑧|𝐶)𝑞(𝐶, 𝑧, 𝑎) # . Then we can simply see that by Definition 3 𝑉stage −𝑉 flat =𝔼

    There exists an action 𝑎★ 𝑐 ∈ Ù 𝑧∈𝑆(𝑐) arg max 𝑎∈ A 𝑞(𝑐, 𝑧, 𝑎). In particular, if there exists a measurable setC0 ⊆ Cwith ℙ𝐶 (C0)>0 and two distinct stages𝑧≠𝑧 ′ such that for every𝑐∈ C0, 𝑝(𝑧|𝑐)>0, 𝑝(𝑧 ′ |𝑐)>0,arg max 𝑎∈ A 𝑞(𝑐, 𝑧, 𝑎) ∩arg max 𝑎∈ A 𝑞(𝑐, 𝑧 ′, 𝑎)=∅, then 𝑉stage > 𝑉flat. Proof. First, werewrite 𝑉stage and 𝑉flat toshowtheirsubtractionformasin 𝔼...

  42. [42]

    This corresponds to analyzing a single parameter update while treating the bank state as fixed

    Current rubric bank.At the current training step, the rubric bank is a fixed measurable objectM. This corresponds to analyzing a single parameter update while treating the bank state as fixed. 3.Task rollout distribution.Given query𝑄=𝑞and bankM, a rollout 𝑇∼𝑝 𝜃(· |𝑞,M) is sampled from the deployed task policy

  43. [43]

    Fixed reflection-context distribution.There exists a fixed measurable distribution𝜉 over query– trajectory pairs (e𝑄,e𝑇) ∼𝜉. One may think of𝜉 as the distribution of trajectories selected for reflection training at the current update, e.g., after first sampling rollout groups under the current behavior policy and then selecting one trajectory per query. F...

  44. [44]

    Although we write 𝑝𝜃 and 𝑟𝜃 separately for clarity, both are induced by the same underlying autoregressive backbone𝜋𝜃 under different prompts / contexts

    Shared-backbone reflection generation.Given(e𝑄,e𝑇)=(𝑞, 𝜏) and bankM, a rubric-grounded reflection 𝑆∼𝑟 𝜃(· |𝑞, 𝜏,M) is sampled. Although we write 𝑝𝜃 and 𝑟𝜃 separately for clarity, both are induced by the same underlying autoregressive backbone𝜋𝜃 under different prompts / contexts

  45. [45]

    Task score, reflection-utility scores, and judge gate.There exist measurable, integrable random quantities 𝑅(𝑄, 𝑇;M) ∈ℝ,Δ w(e𝑄,e𝑇, 𝑆;M) ∈ℝ,Δ c(e𝑄,e𝑇, 𝑆;M) ∈ℝ, and a measurable acceptance indicator 𝐴(e𝑄,e𝑇, 𝑆;M) ∈ {0,1}. Here 𝑅 is the judged task score of the deployed rollout,Δw is the judged within-episode usefulness of the reflection,Δc is the judged cro...

  46. [46]

    For 𝜉-almost every (𝑞, 𝜏), the conditional distribution 𝑟𝜃(𝑠|𝑞, 𝜏,M) is differentiable in𝜃

    Differentiability and score-function regularity.ForD-almost every𝑞, the conditional distri- bution 𝑝𝜃(𝜏|𝑞,M) is differentiable in𝜃. For 𝜉-almost every (𝑞, 𝜏), the conditional distribution 𝑟𝜃(𝑠|𝑞, 𝜏,M) is differentiable in𝜃. Differentiation may be interchanged with the expectations below, and all score-function terms introduced later are square-integrable

  47. [47]

    Moreover, 𝐽task is𝐿 𝐽-smooth and𝑈is𝐿 𝑈-smooth onΘ

    Smoothness.The objectives 𝐽task and 𝑈 defined below are continuously differentiable. Moreover, 𝐽task is𝐿 𝐽-smooth and𝑈is𝐿 𝑈-smooth onΘ. Definition 4(Task objective and judge-gated memory objective).Adopt Assumption 3. Define the task-rollout and reflection score-function sums Γtraj :=∇ 𝜃 log𝑝 𝜃(𝑇|𝑄,M),Γ ref :=∇ 𝜃 log𝑟 𝜃(𝑆| e𝑄,e𝑇,M). Because both condition...

  48. [48]

    The SFT data generation process is described in Appendix B.2

    with DeepSpeed ZeRO-3. The SFT data generation process is described in Appendix B.2. Key hyperparameters: Hyperparameter Value Base model Qwen3-8B Training epochs 5 Learning rate4×10 −5 LR scheduler Cosine with 10% warmup Batch size (per device) 1 Gradient accumulation steps 16 Effective batch size 128 (8 GPUs×16 accum.) Max sequence length 16,384 tokens ...

  49. [49]

    Scores for existing baselines are taken from the Dr

    and supplementing with additional recent models. Scores for existing baselines are taken from the Dr. Tulu paper where available; for models not covered, we run the same evaluation pipeline. • Closed deep research models:Proprietary systems with full deep research capabilities, including OpenAI Deep Research (OpenAI, 2025), Gemini Deep Research (Google De...