Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
arXiv preprint arXiv:2602.03619 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.
citing papers explorer
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preserving OOD performance.