pith. sign in

arxiv: 2605.23590 · v1 · pith:TCMR2K5Vnew · submitted 2026-05-22 · 💻 cs.AI

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Pith reviewed 2026-05-25 04:16 UTC · model grok-4.3

classification 💻 cs.AI
keywords ReAct agentsstep-level rubricsGRPO traininglist-wise rankingsearch agentsmulti-step reasoningDeepResearchBenchSQA-CS-V2
0
0 comments X

The pith

Co-ReAct injects trained rubrics into ReAct agents at each decision step to guide evidence seeking, reasoning, and action selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReAct-style agents decide next steps using only internal judgment, which often yields shallow or redundant trajectories on search-heavy tasks. Co-ReAct supplies external rubrics that specify concrete targets for the next Reason-or-Act choice. These rubrics are produced by a generator trained with GRPO to maximize list-wise Spearman rank correlation against multi-judge expert rankings rather than binary preferences. The resulting guidance yields consistent gains over ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 for both 8B/14B open models and frontier closed models. The same generator can be dropped into other agent frameworks without altering their internal decision loops.

Core claim

Co-ReAct treats rubrics as active step-level collaborators that are injected into the agent's context to direct the next decision, with the generator optimized via GRPO on a list-wise Spearman reward that aligns with expert consensus rankings; this produces more discriminative guidance than prior evaluative uses of rubrics and transfers across tasks and base models.

What carries the argument

Rubric generator trained with GRPO on list-wise Spearman rank-correlation reward against multi-judge consensus rankings, injected at each agent step to specify targets for evidence, search, reasoning or self-evaluation.

If this is right

  • Agents generate less redundant and more targeted trajectories on search-intensive tasks.
  • Performance gains appear across 8B, 14B open-source and frontier closed-source models on DeepResearchBench and SQA-CS-V2.
  • The rubric generator improves other test-time compute baselines as a drop-in module without altering their decision mechanisms.
  • Rubrics move from post-hoc evaluation to real-time action guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same list-wise ranking objective could be applied to generate guidance signals for agent frameworks other than ReAct.
  • Step-level rubric injection may reduce the need for extensive prompt engineering or additional test-time sampling.
  • Dynamic regeneration of rubrics mid-trajectory based on observed progress could further tighten guidance.
  • The approach separates the production of quality signals from the agent's policy, allowing independent scaling of the rubric model.

Load-bearing premise

The trained rubric generator produces reliable, discriminative step-level guidance that transfers to new tasks and base models without requiring changes to the agent's core decision loop.

What would settle it

Running the trained rubric generator on a new multi-step search benchmark or different base model and finding no improvement over the corresponding ReAct baseline would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2605.23590 by Bowen Zhang, Da Zhu, Guanjun Jiang, Jiangwang Chen, Jiazheng Kang, Xiao Yang, Zixin Song.

Figure 1
Figure 1. Figure 1: Overview of Co-ReAct. (i) Collect: sample candidate next actions at each branching point and rank them [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DRB RACE sub-metric results with Gem￾ini 3.1 Pro used as the search agent, answer generator, and rubric generator. Co-ReAct achieves the best score on every sub-metric. Dashed lines mark the ReAct base￾line in each group. methods (Self-Refine, Best-of-N, CRITIC) fail to improve over ReAct on this strong model, sug￾gesting that self-correction and resampling offer diminishing returns when the base agent is … view at source ↗
Figure 3
Figure 3. Figure 3: Plug-in rubric portability. The rubric trained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the rubric–verify–retry mech￾anism on a single SQA-CS-V2 question about DepthCrafter. ReAct and Co-ReAct issue identical Q: What is DepthCrafter and how does it differ from prior monocular depth estimators? ReAct a1 a2 a3 academic_search (query="DepthCrafter video depth", year="2024", limit=8) obs: 3 arXiv hits, top = DepthCrafter google_search(query="DepthCrafter paper CVPR 2025") obs: project… view at source ↗
read the original abstract

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Co-ReAct, a rubric-guided framework for ReAct-style agents on search-intensive tasks. At each step the agent receives a rubric (produced by a GRPO-trained generator) that supplies step-level targets for evidence seeking, reasoning, or self-evaluation. The generator is optimized with a list-wise Spearman rank-correlation objective against multi-judge expert consensus rankings rather than pairwise preferences. The paper reports consistent gains over ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 for both 8B/14B open-source and frontier closed-source models; the trained generator is also shown to improve other baselines as a drop-in module. Public code is released.

Significance. If the reported gains hold under the stated evaluation protocol, the work supplies a practical, model-agnostic mechanism for injecting step-level guidance into existing agent loops without retraining the base policy. Strengths include the explicit list-wise ranking training objective, cross-model (open and closed) evaluation, and public code release, all of which facilitate reproducibility and extension.

major comments (2)
  1. [Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.
  2. [§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.
minor comments (2)
  1. [Method] The training objective is described as list-wise Spearman correlation; a brief equation or pseudocode would clarify how ties and list length are handled.
  2. [Figures/Tables] Figure captions and table headers should explicitly state the number of runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.

    Authors: We agree that explicit numerical deltas, standard deviations, and dataset statistics are necessary to fully substantiate the claims of consistent improvements. Although the results section contains performance tables, we will revise it to include these additional details (effect sizes, variability measures, and dataset statistics) for improved clarity and statistical transparency. revision: yes

  2. Referee: [§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.

    Authors: We acknowledge that additional implementation details are required for reproducibility. In the revised manuscript we will expand §3.2 to include the exact prompt template, the precise placement of the rubric relative to the ReAct history, and the token budget used for injection. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical method for training a rubric generator via GRPO, with the reward explicitly defined as list-wise Spearman rank-correlation against external multi-judge expert consensus rankings. No equations, first-principles derivations, or predictions appear that reduce by construction to the paper's own fitted values or self-citations. The central claims rest on benchmark improvements (DeepResearchBench, SQA-CS-V2) across multiple base models, with the training objective tied to independent external data rather than internal self-reference. This is a standard empirical RL setup with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5833 in / 919 out tokens · 20942 ms · 2026-05-25T04:16:10.476108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 11 internal anchors

  1. [1]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  2. [2]

    Advances in neural information processing systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

  3. [3]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  4. [4]

    Advances in neural information processing systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

  5. [5]

    International Conference on Learning Representations , volume=

    Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=

  6. [6]

    International Conference on Learning Representations , volume=

    Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=

  7. [7]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  8. [8]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

  9. [9]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

  10. [10]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  11. [11]

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

    Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=

  12. [12]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Rubrics as rewards: Reinforcement learning beyond verifiable domains , author=. arXiv preprint arXiv:2507.17746 , year=

  13. [13]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Memagent: Reshaping long-context llm with multi-conv rl-based memory agent , author=. arXiv preprint arXiv:2507.02259 , year=

  14. [14]

    arXiv preprint arXiv:2602.01511 , year=

    Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training , author=. arXiv preprint arXiv:2602.01511 , year=

  15. [15]

    arXiv preprint arXiv:2510.07743 , year=

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment , author=. arXiv preprint arXiv:2510.07743 , year=

  16. [16]

    arXiv e-prints , pages=

    Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following , author=. arXiv e-prints , pages=

  17. [17]

    arXiv preprint arXiv:2602.03619 , year=

    Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation , author=. arXiv preprint arXiv:2602.03619 , year=

  18. [18]

    arXiv preprint arXiv:2602.10885 , year=

    Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. arXiv preprint arXiv:2602.10885 , year=

  19. [19]

    2024 , volume =

    Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

  20. [20]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  21. [21]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  22. [22]

    International Conference on Learning Representations , volume=

    Let's verify step by step , author=. International Conference on Learning Representations , volume=

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  24. [24]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    Deepresearch bench: A comprehensive benchmark for deep research agents , author=. arXiv preprint arXiv:2506.11763 , year=

  25. [25]

    Weld and Doug Downey and Wen

    OpenScholar: synthesizing scientific literature with retrieval-augmented language models , author=. Preprint at Arxiv https://arxiv. org/abs/2411.14199 , year=

  26. [26]

    The Probabilistic Relevance Framework:

    Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:. 2009 , publisher =

  27. [27]

    2009 , publisher=

    The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

  28. [28]

    Social Choice and Welfare , volume=

    The original Borda count and partial voting , author=. Social Choice and Welfare , volume=. 2013 , publisher=

  29. [29]

    arXiv preprint arXiv:2510.04695 , year=

    Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents , author=. arXiv preprint arXiv:2510.04695 , year=

  30. [30]

    arXiv preprint arXiv:2509.22391 , year=

    Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents , author=. arXiv preprint arXiv:2509.22391 , year=

  31. [31]

    Educational Leadership , volume=

    What's Wrong---and What's Right---with Rubrics , author=. Educational Leadership , volume=

  32. [32]

    Frontiers in education , volume=

    Appropriate criteria: Key to effective rubrics , author=. Frontiers in education , volume=. 2018 , organization=

  33. [33]

    The American Journal of Psychology , volume =

    The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , doi =

  34. [34]

    arXiv preprint arXiv:2511.10507 , year=

    Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following , author=. arXiv preprint arXiv:2511.10507 , year=

  35. [35]

    arXiv preprint arXiv:2510.04080 , year=

    PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , author=. arXiv preprint arXiv:2510.04080 , year=