Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Pith reviewed 2026-05-25 04:16 UTC · model grok-4.3
The pith
Co-ReAct injects trained rubrics into ReAct agents at each decision step to guide evidence seeking, reasoning, and action selection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Co-ReAct treats rubrics as active step-level collaborators that are injected into the agent's context to direct the next decision, with the generator optimized via GRPO on a list-wise Spearman reward that aligns with expert consensus rankings; this produces more discriminative guidance than prior evaluative uses of rubrics and transfers across tasks and base models.
What carries the argument
Rubric generator trained with GRPO on list-wise Spearman rank-correlation reward against multi-judge consensus rankings, injected at each agent step to specify targets for evidence, search, reasoning or self-evaluation.
If this is right
- Agents generate less redundant and more targeted trajectories on search-intensive tasks.
- Performance gains appear across 8B, 14B open-source and frontier closed-source models on DeepResearchBench and SQA-CS-V2.
- The rubric generator improves other test-time compute baselines as a drop-in module without altering their decision mechanisms.
- Rubrics move from post-hoc evaluation to real-time action guidance.
Where Pith is reading between the lines
- The same list-wise ranking objective could be applied to generate guidance signals for agent frameworks other than ReAct.
- Step-level rubric injection may reduce the need for extensive prompt engineering or additional test-time sampling.
- Dynamic regeneration of rubrics mid-trajectory based on observed progress could further tighten guidance.
- The approach separates the production of quality signals from the agent's policy, allowing independent scaling of the rubric model.
Load-bearing premise
The trained rubric generator produces reliable, discriminative step-level guidance that transfers to new tasks and base models without requiring changes to the agent's core decision loop.
What would settle it
Running the trained rubric generator on a new multi-step search benchmark or different base model and finding no improvement over the corresponding ReAct baseline would falsify the transfer claim.
Figures
read the original abstract
ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Co-ReAct, a rubric-guided framework for ReAct-style agents on search-intensive tasks. At each step the agent receives a rubric (produced by a GRPO-trained generator) that supplies step-level targets for evidence seeking, reasoning, or self-evaluation. The generator is optimized with a list-wise Spearman rank-correlation objective against multi-judge expert consensus rankings rather than pairwise preferences. The paper reports consistent gains over ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 for both 8B/14B open-source and frontier closed-source models; the trained generator is also shown to improve other baselines as a drop-in module. Public code is released.
Significance. If the reported gains hold under the stated evaluation protocol, the work supplies a practical, model-agnostic mechanism for injecting step-level guidance into existing agent loops without retraining the base policy. Strengths include the explicit list-wise ranking training objective, cross-model (open and closed) evaluation, and public code release, all of which facilitate reproducibility and extension.
major comments (2)
- [Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.
- [§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.
minor comments (2)
- [Method] The training objective is described as list-wise Spearman correlation; a brief equation or pseudocode would clarify how ties and list length are handled.
- [Figures/Tables] Figure captions and table headers should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.
Authors: We agree that explicit numerical deltas, standard deviations, and dataset statistics are necessary to fully substantiate the claims of consistent improvements. Although the results section contains performance tables, we will revise it to include these additional details (effect sizes, variability measures, and dataset statistics) for improved clarity and statistical transparency. revision: yes
-
Referee: [§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.
Authors: We acknowledge that additional implementation details are required for reproducibility. In the revised manuscript we will expand §3.2 to include the exact prompt template, the precise placement of the rubric relative to the ReAct history, and the token budget used for injection. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical method for training a rubric generator via GRPO, with the reward explicitly defined as list-wise Spearman rank-correlation against external multi-judge expert consensus rankings. No equations, first-principles derivations, or predictions appear that reduce by construction to the paper's own fitted values or self-citations. The central claims rest on benchmark improvements (DeepResearchBench, SQA-CS-V2) across multiple base models, with the training objective tied to independent external data rather than internal self-reference. This is a standard empirical RL setup with no load-bearing self-citation chains or ansatz smuggling.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[3]
Advances in neural information processing systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[4]
Advances in neural information processing systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
-
[5]
International Conference on Learning Representations , volume=
Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=
-
[6]
International Conference on Learning Representations , volume=
Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=
-
[7]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Rubrics as rewards: Reinforcement learning beyond verifiable domains , author=. arXiv preprint arXiv:2507.17746 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Memagent: Reshaping long-context llm with multi-conv rl-based memory agent , author=. arXiv preprint arXiv:2507.02259 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2602.01511 , year=
Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training , author=. arXiv preprint arXiv:2602.01511 , year=
-
[15]
arXiv preprint arXiv:2510.07743 , year=
Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment , author=. arXiv preprint arXiv:2510.07743 , year=
-
[16]
Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following , author=. arXiv e-prints , pages=
-
[17]
arXiv preprint arXiv:2602.03619 , year=
Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation , author=. arXiv preprint arXiv:2602.03619 , year=
-
[18]
arXiv preprint arXiv:2602.10885 , year=
Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. arXiv preprint arXiv:2602.10885 , year=
-
[19]
Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =
work page 2024
-
[20]
Constitutional AI: Harmlessness from AI Feedback
Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[22]
International Conference on Learning Representations , volume=
Let's verify step by step , author=. International Conference on Learning Representations , volume=
-
[23]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Deepresearch bench: A comprehensive benchmark for deep research agents , author=. arXiv preprint arXiv:2506.11763 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
OpenScholar: synthesizing scientific literature with retrieval-augmented language models , author=. Preprint at Arxiv https://arxiv. org/abs/2411.14199 , year=
-
[26]
The Probabilistic Relevance Framework:
Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:. 2009 , publisher =
work page 2009
-
[27]
The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=
work page 2009
-
[28]
Social Choice and Welfare , volume=
The original Borda count and partial voting , author=. Social Choice and Welfare , volume=. 2013 , publisher=
work page 2013
-
[29]
arXiv preprint arXiv:2510.04695 , year=
Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents , author=. arXiv preprint arXiv:2510.04695 , year=
-
[30]
arXiv preprint arXiv:2509.22391 , year=
Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents , author=. arXiv preprint arXiv:2509.22391 , year=
-
[31]
Educational Leadership , volume=
What's Wrong---and What's Right---with Rubrics , author=. Educational Leadership , volume=
-
[32]
Frontiers in education , volume=
Appropriate criteria: Key to effective rubrics , author=. Frontiers in education , volume=. 2018 , organization=
work page 2018
-
[33]
The American Journal of Psychology , volume =
The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , doi =
work page 1904
-
[34]
arXiv preprint arXiv:2511.10507 , year=
Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following , author=. arXiv preprint arXiv:2511.10507 , year=
-
[35]
arXiv preprint arXiv:2510.04080 , year=
PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , author=. arXiv preprint arXiv:2510.04080 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.