Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Bowen Zhang; Da Zhu; Guanjun Jiang; Jiangwang Chen; Jiazheng Kang; Xiao Yang; Zixin Song

arxiv: 2605.23590 · v1 · pith:TCMR2K5Vnew · submitted 2026-05-22 · 💻 cs.AI

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Jiazheng Kang , Bowen Zhang , Zixin Song , Jiangwang Chen , Xiao Yang , Da Zhu , Guanjun Jiang This is my paper

Pith reviewed 2026-05-25 04:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords ReAct agentsstep-level rubricsGRPO traininglist-wise rankingsearch agentsmulti-step reasoningDeepResearchBenchSQA-CS-V2

0 comments

The pith

Co-ReAct injects trained rubrics into ReAct agents at each decision step to guide evidence seeking, reasoning, and action selection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReAct-style agents decide next steps using only internal judgment, which often yields shallow or redundant trajectories on search-heavy tasks. Co-ReAct supplies external rubrics that specify concrete targets for the next Reason-or-Act choice. These rubrics are produced by a generator trained with GRPO to maximize list-wise Spearman rank correlation against multi-judge expert rankings rather than binary preferences. The resulting guidance yields consistent gains over ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 for both 8B/14B open models and frontier closed models. The same generator can be dropped into other agent frameworks without altering their internal decision loops.

Core claim

Co-ReAct treats rubrics as active step-level collaborators that are injected into the agent's context to direct the next decision, with the generator optimized via GRPO on a list-wise Spearman reward that aligns with expert consensus rankings; this produces more discriminative guidance than prior evaluative uses of rubrics and transfers across tasks and base models.

What carries the argument

Rubric generator trained with GRPO on list-wise Spearman rank-correlation reward against multi-judge consensus rankings, injected at each agent step to specify targets for evidence, search, reasoning or self-evaluation.

If this is right

Agents generate less redundant and more targeted trajectories on search-intensive tasks.
Performance gains appear across 8B, 14B open-source and frontier closed-source models on DeepResearchBench and SQA-CS-V2.
The rubric generator improves other test-time compute baselines as a drop-in module without altering their decision mechanisms.
Rubrics move from post-hoc evaluation to real-time action guidance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same list-wise ranking objective could be applied to generate guidance signals for agent frameworks other than ReAct.
Step-level rubric injection may reduce the need for extensive prompt engineering or additional test-time sampling.
Dynamic regeneration of rubrics mid-trajectory based on observed progress could further tighten guidance.
The approach separates the production of quality signals from the agent's policy, allowing independent scaling of the rubric model.

Load-bearing premise

The trained rubric generator produces reliable, discriminative step-level guidance that transfers to new tasks and base models without requiring changes to the agent's core decision loop.

What would settle it

Running the trained rubric generator on a new multi-step search benchmark or different base model and finding no improvement over the corresponding ReAct baseline would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2605.23590 by Bowen Zhang, Da Zhu, Guanjun Jiang, Jiangwang Chen, Jiazheng Kang, Xiao Yang, Zixin Song.

**Figure 2.** Figure 2: DRB RACE sub-metric results with Gemini 3.1 Pro used as the search agent, answer generator, and rubric generator. Co-ReAct achieves the best score on every sub-metric. Dashed lines mark the ReAct baseline in each group. methods (Self-Refine, Best-of-N, CRITIC) fail to improve over ReAct on this strong model, suggesting that self-correction and resampling offer diminishing returns when the base agent is … view at source ↗

**Figure 3.** Figure 3: Plug-in rubric portability. The rubric trained [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates the rubric–verify–retry mechanism on a single SQA-CS-V2 question about DepthCrafter. ReAct and Co-ReAct issue identical Q: What is DepthCrafter and how does it differ from prior monocular depth estimators? ReAct a1 a2 a3 academic_search (query="DepthCrafter video depth", year="2024", limit=8) obs: 3 arXiv hits, top = DepthCrafter google_search(query="DepthCrafter paper CVPR 2025") obs: project… view at source ↗

read the original abstract

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-ReAct adds step-level rubric injection at inference time via a GRPO-trained generator using list-wise Spearman correlation to expert rankings, with reported gains on two search benchmarks across model sizes.

read the letter

The core idea is straightforward: instead of letting ReAct agents rely only on their own judgment for next steps in search-heavy tasks, Co-ReAct inserts a trained rubric at each decision point to steer evidence seeking, reasoning, or stopping. The rubric generator is optimized with GRPO on a list-wise Spearman rank-correlation reward against multi-judge expert consensus, which is meant to favor discriminative rubrics over merely plausible ones. This setup is positioned as inference-time and action-guiding rather than post-hoc evaluation or training reward only. The same generator is shown as a drop-in that improves other baselines without touching their decision loops. Evaluation covers DeepResearchBench and SQA-CS-V2, open 8B/14B models plus closed frontier ones, and code is released publicly. That combination of cross-model testing and reproducibility is the part that stands out as useful. The training objective ties directly to producing rankings that match expert consensus, which avoids some circularity issues in preference tuning. Soft spots are mostly around missing details in the summary: no numbers, error bars, or ablation on the Spearman component versus simpler objectives appear here, so the size of the lift and how much the new objective contributes remain unclear. Transfer of the generator to new tasks or models is asserted but would need checks on how the expert data was collected and whether domain shift hurts the rubrics. The central claim holds together without obvious internal contradictions or unstated assumptions that would break the reported gains. This is for people working on agent trajectories in multi-step search and reasoning. Readers who want concrete inference-time techniques with public code will find something to try. It deserves a serious referee to examine the full experiments, training data, and statistical details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Co-ReAct, a rubric-guided framework for ReAct-style agents on search-intensive tasks. At each step the agent receives a rubric (produced by a GRPO-trained generator) that supplies step-level targets for evidence seeking, reasoning, or self-evaluation. The generator is optimized with a list-wise Spearman rank-correlation objective against multi-judge expert consensus rankings rather than pairwise preferences. The paper reports consistent gains over ReAct and test-time compute baselines on DeepResearchBench and SQA-CS-V2 for both 8B/14B open-source and frontier closed-source models; the trained generator is also shown to improve other baselines as a drop-in module. Public code is released.

Significance. If the reported gains hold under the stated evaluation protocol, the work supplies a practical, model-agnostic mechanism for injecting step-level guidance into existing agent loops without retraining the base policy. Strengths include the explicit list-wise ranking training objective, cross-model (open and closed) evaluation, and public code release, all of which facilitate reproducibility and extension.

major comments (2)

[Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.
[§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.

minor comments (2)

[Method] The training objective is described as list-wise Spearman correlation; a brief equation or pseudocode would clarify how ties and list length are handled.
[Figures/Tables] Figure captions and table headers should explicitly state the number of runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below.

read point-by-point responses

Referee: [Results] Results section (and any accompanying tables/figures): the abstract asserts 'consistent improvements' across two benchmarks and multiple model families, yet the supplied text contains no numerical deltas, standard deviations, or dataset statistics. Without these quantities it is impossible to judge effect size or statistical reliability of the central empirical claim.

Authors: We agree that explicit numerical deltas, standard deviations, and dataset statistics are necessary to fully substantiate the claims of consistent improvements. Although the results section contains performance tables, we will revise it to include these additional details (effect sizes, variability measures, and dataset statistics) for improved clarity and statistical transparency. revision: yes
Referee: [§3.2] §3.2 (rubric injection): the description states that the rubric is 'injected into the agent's context' but does not specify the exact prompt template, placement relative to the ReAct history, or token budget. This detail is load-bearing for the claim that the generator functions as a drop-in component without altering core decision loops.

Authors: We acknowledge that additional implementation details are required for reproducibility. In the revised manuscript we will expand §3.2 to include the exact prompt template, the precise placement of the rubric relative to the ReAct history, and the token budget used for injection. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical method for training a rubric generator via GRPO, with the reward explicitly defined as list-wise Spearman rank-correlation against external multi-judge expert consensus rankings. No equations, first-principles derivations, or predictions appear that reduce by construction to the paper's own fitted values or self-citations. The central claims rest on benchmark improvements (DeepResearchBench, SQA-CS-V2) across multiple base models, with the training objective tied to independent external data rather than internal self-reference. This is a standard empirical RL setup with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5833 in / 919 out tokens · 20942 ms · 2026-05-25T04:16:10.476108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 11 internal anchors

[1]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page
[3]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[4]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page
[5]

International Conference on Learning Representations , volume=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=

work page
[6]

International Conference on Learning Representations , volume=

Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=

work page
[7]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learning beyond verifiable domains , author=. arXiv preprint arXiv:2507.17746 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent , author=. arXiv preprint arXiv:2507.02259 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2602.01511 , year=

Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training , author=. arXiv preprint arXiv:2602.01511 , year=

work page arXiv
[15]

arXiv preprint arXiv:2510.07743 , year=

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment , author=. arXiv preprint arXiv:2510.07743 , year=

work page arXiv
[16]

arXiv e-prints , pages=

Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following , author=. arXiv e-prints , pages=

work page
[17]

arXiv preprint arXiv:2602.03619 , year=

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation , author=. arXiv preprint arXiv:2602.03619 , year=

work page arXiv
[18]

arXiv preprint arXiv:2602.10885 , year=

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. arXiv preprint arXiv:2602.10885 , year=

work page arXiv
[19]

2024 , volume =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

work page 2024
[20]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[22]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=

work page
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Deepresearch bench: A comprehensive benchmark for deep research agents , author=. arXiv preprint arXiv:2506.11763 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Weld and Doug Downey and Wen

OpenScholar: synthesizing scientific literature with retrieval-augmented language models , author=. Preprint at Arxiv https://arxiv. org/abs/2411.14199 , year=

work page arXiv
[26]

The Probabilistic Relevance Framework:

Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:. 2009 , publisher =

work page 2009
[27]

2009 , publisher=

The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

work page 2009
[28]

Social Choice and Welfare , volume=

The original Borda count and partial voting , author=. Social Choice and Welfare , volume=. 2013 , publisher=

work page 2013
[29]

arXiv preprint arXiv:2510.04695 , year=

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents , author=. arXiv preprint arXiv:2510.04695 , year=

work page arXiv
[30]

arXiv preprint arXiv:2509.22391 , year=

Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents , author=. arXiv preprint arXiv:2509.22391 , year=

work page arXiv
[31]

Educational Leadership , volume=

What's Wrong---and What's Right---with Rubrics , author=. Educational Leadership , volume=

work page
[32]

Frontiers in education , volume=

Appropriate criteria: Key to effective rubrics , author=. Frontiers in education , volume=. 2018 , organization=

work page 2018
[33]

The American Journal of Psychology , volume =

The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , doi =

work page 1904
[34]

arXiv preprint arXiv:2511.10507 , year=

Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following , author=. arXiv preprint arXiv:2511.10507 , year=

work page arXiv
[35]

arXiv preprint arXiv:2510.04080 , year=

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , author=. arXiv preprint arXiv:2510.04080 , year=

work page arXiv

[1] [1]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page

[3] [3]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

work page

[4] [4]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

International Conference on Learning Representations , volume=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=

work page

[6] [6]

International Conference on Learning Representations , volume=

Take a step back: Evoking reasoning via abstraction in large language models , author=. International Conference on Learning Representations , volume=

work page

[7] [7]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2503.05592 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Dr tulu: Reinforcement learning with evolving rubrics for deep research , author=. arXiv preprint arXiv:2511.19399 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learning beyond verifiable domains , author=. arXiv preprint arXiv:2507.17746 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent , author=. arXiv preprint arXiv:2507.02259 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2602.01511 , year=

Alternating reinforcement learning for rubric-based reward modeling in non-verifiable llm post-training , author=. arXiv preprint arXiv:2602.01511 , year=

work page arXiv

[15] [15]

arXiv preprint arXiv:2510.07743 , year=

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment , author=. arXiv preprint arXiv:2510.07743 , year=

work page arXiv

[16] [16]

arXiv e-prints , pages=

Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following , author=. arXiv e-prints , pages=

work page

[17] [17]

arXiv preprint arXiv:2602.03619 , year=

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation , author=. arXiv preprint arXiv:2602.03619 , year=

work page arXiv

[18] [18]

arXiv preprint arXiv:2602.10885 , year=

Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. arXiv preprint arXiv:2602.10885 , year=

work page arXiv

[19] [19]

2024 , volume =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , volume =

work page 2024

[20] [20]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[22] [22]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=

work page

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Deepresearch bench: A comprehensive benchmark for deep research agents , author=. arXiv preprint arXiv:2506.11763 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Weld and Doug Downey and Wen

OpenScholar: synthesizing scientific literature with retrieval-augmented language models , author=. Preprint at Arxiv https://arxiv. org/abs/2411.14199 , year=

work page arXiv

[26] [26]

The Probabilistic Relevance Framework:

Robertson, Stephen and Zaragoza, Hugo , journal =. The Probabilistic Relevance Framework:. 2009 , publisher =

work page 2009

[27] [27]

2009 , publisher=

The probabilistic relevance framework: BM25 and beyond , author=. 2009 , publisher=

work page 2009

[28] [28]

Social Choice and Welfare , volume=

The original Borda count and partial voting , author=. Social Choice and Welfare , volume=. 2013 , publisher=

work page 2013

[29] [29]

arXiv preprint arXiv:2510.04695 , year=

Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents , author=. arXiv preprint arXiv:2510.04695 , year=

work page arXiv

[30] [30]

arXiv preprint arXiv:2509.22391 , year=

Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information-seeking agents , author=. arXiv preprint arXiv:2509.22391 , year=

work page arXiv

[31] [31]

Educational Leadership , volume=

What's Wrong---and What's Right---with Rubrics , author=. Educational Leadership , volume=

work page

[32] [32]

Frontiers in education , volume=

Appropriate criteria: Key to effective rubrics , author=. Frontiers in education , volume=. 2018 , organization=

work page 2018

[33] [33]

The American Journal of Psychology , volume =

The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , doi =

work page 1904

[34] [34]

arXiv preprint arXiv:2511.10507 , year=

Advancedif: Rubric-based benchmarking and reinforcement learning for advancing llm instruction following , author=. arXiv preprint arXiv:2511.10507 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2510.04080 , year=

PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity , author=. arXiv preprint arXiv:2510.04080 , year=

work page arXiv