ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

Jingsen Zhang; Rui Li; Xiaohe Bo; Xu Chen; Yuanzi Li; Zihang Tian

arxiv: 2606.21262 · v1 · pith:ZROJB42Xnew · submitted 2026-06-19 · 💻 cs.AI · cs.CL

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

Zihang Tian , Jingsen Zhang , Rui Li , Xiaohe Bo , Yuanzi Li , Xu Chen This is my paper

Pith reviewed 2026-06-26 14:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords adaptive rubricco-evolutionLLM agentsmulti-step reasoningcredit assignmentreward modelingHotpotQAtrajectory decomposition

0 comments

The pith

ARCO co-evolves a shared-backbone model that generates per-step rubrics and scores them so the policy can receive step-level rewards without step labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ARCO to solve the problem that scalar terminal rewards give no guidance on which steps in a multi-step LLM trajectory were responsible for success or failure. It trains a single model with a generation head that writes natural-language criteria for each step and a score head that assigns rewards to those criteria, while a decomposition constraint forces the sum of those step rewards to match the known terminal outcome. The rubric model and the agent policy are then updated together on fresh on-policy trajectories, allowing both the content of the rubrics and the way they are scored to adapt as the agent improves. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue show higher exact-match scores than outcome, rubric, and process baselines for two different open-source backbones.

Core claim

ARCO is a rubric framework in which a same-scale model μ shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while μ and the policy π are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level.

What carries the argument

The joint training of the generation and score heads inside model μ with the agent policy π under the trajectory decomposition constraint that forces step rewards to sum to the terminal outcome.

If this is right

Exact-match scores rise above the best outcome-reward, rubric-reward, and process-reward baselines on all three multi-hop QA datasets for both backbones tested.
The generated rubrics are step-specific rather than generic trajectory-level statements.
The rubrics remain effective across different design choices for the generation prompt.
The rubrics can be inspected to diagnose which steps the agent is handling well or poorly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-evolution pattern could be applied to agent tasks whose terminal signal is cheaper to obtain than step-by-step supervision, such as tool-use or web navigation.
Because the rubric model stays at the same scale as the policy, the method may scale more readily than approaches that rely on a larger frozen judge.
The step-level scores produced by the score head could serve as an auxiliary training signal for future process-supervised fine-tuning runs.

Load-bearing premise

The constraint that the sum of the predicted step rewards must equal the terminal outcome is enough to produce useful step-level credit assignment even though no step labels are available.

What would settle it

Ablating the trajectory decomposition constraint while keeping every other component of ARCO fixed and checking whether the exact-match gains on HotpotQA, 2WikiMultiHopQA, and MuSiQue disappear.

Figures

Figures reproduced from arXiv: 2606.21262 by Jingsen Zhang, Rui Li, Xiaohe Bo, Xu Chen, Yuanzi Li, Zihang Tian.

**Figure 2.** Figure 2: Three single-step score comparisons on MuSiQue dev. In each panel, ARCO and one [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: RQ4 K-sensitivity diagnostics on HotpotQA / Qwen for the sweep on fixed policy-warmup trajectories (K ∈ {1, 3, 5, 7, 9}). (A) Dev EM/F1 at each best-dev-EM checkpoint. (B) Distinct semantic themes per step (blue, left) and distinct-theme ratio #themes/K (red, right). (C) Step-level total semantic duplicate rate (orange, left) and score range (purple, right; max–min scores within same-step duplicate semanti… view at source ↗

**Figure 4.** Figure 4: Three K=7 HotpotQA / Qwen dev steps with different redundancy patterns. Colored dots show the theme of each criterion. Exact duplicate: 7 byte-identical criteria. Semantic duplicate: 7 different wordings all scoring the same theme. Diverse: 7 criteria span 5 distinct themes. through paraphrase. This suggests allocating criterion slots by importance rather than diversity. We adopt K=3 as default: it achieve… view at source ↗

**Figure 5.** Figure 5: A HotpotQA dev trajectory under the four-way binding protocol (Qwen3-4B, ARCO epoch [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Two RQ3 rubric-binding failures. In Case 1, the gold action is [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Policy prompt for the HotpotQA agent. The forced-finish variant is used when the search [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Shared rubric system prompt that defines the evaluator role and output constraints. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Step-level rubric prompt that generates local criteria for an individual policy action. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Trajectory-level rubric prompt used during warmup annotation to produce criteria and [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: RQ3 four-way step-binding judge prompt ( [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: RQ3 step-specificity judge prompt (Spec). The judge sees the question, trajectory prefix, gold action, and the rubric, and rates how specifically the rubric evaluates the gold action on a 1–5 scale. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Reinforcement learning for multi-step LLM agents often relies on scalar rewards that indicate success but cannot explain why a trajectory is good or bad. Rubric-based rewards improve interpretability through natural-language criteria, but existing methods score at the trajectory level and freeze the scorer behind a closed-source judge, leaving step-level credit assignment unresolved and the judge itself static. We propose ARCO (Adaptive Rubric CO-evolution), a rubric framework in which a same-scale model $\mu$ shares a backbone with two heads: a generation head that produces per-step criteria, and a score head that predicts rubric-conditioned step-level rewards. A trajectory decomposition constraint ties the sum of step rewards to the terminal outcome, enabling credit assignment without step-level labels, while $\mu$ and the policy $\pi$ are jointly updated on on-policy data so that the rubric content and the scoring function co-evolve at the parameter level. Across HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, ARCO improves the best EM in every setting over strong outcome-, rubric-, and process-reward baselines, and analyses show that its rubrics are step-specific, robust to design choices, and useful for diagnosing agent behavior. Codes and data are available at https://github.com/zihangtian/ARCO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARCO introduces joint on-policy co-evolution of a shared model's rubric-generation and scoring heads plus a sum-to-outcome constraint for step rewards, but the constraint risks permitting uninformative per-step scores.

read the letter

ARCO's core move is to put rubric generation and step scoring inside one same-scale model with separate heads, then update both heads together with the policy on on-policy trajectories while enforcing that the sum of step scores equals the terminal outcome. This lets them train without step labels and keeps the rubric content from freezing.

The architecture is the clearest novelty: generation and scoring share the backbone and evolve at the parameter level rather than relying on a static external judge. They report exact-match gains over outcome-reward, rubric, and process-reward baselines on HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open backbones, and they release code and data. That combination of framing and reproducibility is what makes the work worth looking at.

The soft spot is the decomposition constraint itself. It only requires the step scores to add up to the terminal signal; nothing in that equation prevents uniform or front-loaded scores that carry no step-specific information. The abstract claims the joint updates produce step-specific rubrics and that analyses support this, but the stress-test concern stands until the loss details and any variance or contrast terms are checked. If those terms are weak or absent, the EM improvements could come from better trajectory filtering rather than genuine credit assignment.

This paper is for people already working on process rewards or rubric methods for agents. A reader who wants to see whether the constraint actually delivers usable step signals will get value from the experiments and code. The thinking is coherent on its own terms even if the central assumption needs more evidence.

I would send it to peer review so the loss formulation, ablations on the constraint, and the step-specificity analyses can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes ARCO, a framework for multi-step LLM agents in which a single-scale model μ shares a backbone between a generation head (producing per-step natural-language rubrics) and a score head (producing rubric-conditioned step rewards). A trajectory decomposition constraint enforces that the sum of step-level rewards equals the terminal outcome reward, enabling credit assignment without step labels; μ and the policy π are jointly trained on on-policy trajectories so that rubric generation and scoring co-evolve. The authors report that ARCO raises exact-match scores over outcome-, rubric-, and process-reward baselines on HotpotQA, 2WikiMultiHopQA, and MuSiQue with two open-source backbones, and provide analyses indicating that the learned rubrics are step-specific and diagnostically useful. Code and data are released.

Significance. If the central empirical claims hold, the work supplies a practical route to interpretable, step-level credit assignment for LLM agents without requiring step-level human labels. The joint co-evolution of rubric content and scoring function at the parameter level, together with the open release of code and data, would constitute a reproducible contribution to the literature on process supervision and rubric-based RL.

major comments (2)

[Training Objective / §3] The trajectory decomposition constraint (abstract and §3) that the sum of per-step rewards equals the terminal outcome is under-constrained: any assignment of step rewards whose sum matches the outcome satisfies the loss, including constant or front-loaded rewards that carry no information about individual step quality. The description does not mention auxiliary terms (entropy, variance, or contrastive penalties) that would force differentiation across steps; without such terms the observed EM gains could arise from improved trajectory-level selection rather than genuine step-level credit assignment.
[Experiments / §4] The experimental section reports that ARCO improves the best EM in every setting, yet provides no ablation that isolates the decomposition constraint from the co-evolution mechanism, no statistical significance tests across runs, and no verification that the learned step rewards are non-uniform or predictive of step quality. These omissions leave the load-bearing claim—that the constraint produces useful step-level credit assignment—unverified.

minor comments (2)

[Method] Notation for the shared backbone and the two heads (generation vs. score) is introduced only in the abstract; a diagram or explicit parameter-sharing equation in §3 would clarify the architecture.
[Analysis] The abstract states that rubrics are “step-specific,” but the precise metric or qualitative protocol used to establish specificity is not summarized; a short table or example in the main text would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on the trajectory decomposition constraint and experimental validation. We address each point below, clarifying the role of per-step rubric generation in promoting differentiation and noting where the manuscript can be strengthened.

read point-by-point responses

Referee: [Training Objective / §3] The trajectory decomposition constraint (abstract and §3) that the sum of per-step rewards equals the terminal outcome is under-constrained: any assignment of step rewards whose sum matches the outcome satisfies the loss, including constant or front-loaded rewards that carry no information about individual step quality. The description does not mention auxiliary terms (entropy, variance, or contrastive penalties) that would force differentiation across steps; without such terms the observed EM gains could arise from improved trajectory-level selection rather than genuine step-level credit assignment.

Authors: The constraint is mathematically satisfied by any step-reward assignment summing to the outcome. However, because the generation head produces distinct natural-language rubrics for each step and the score head conditions directly on those rubrics, the learned scoring function must differentiate across steps to produce rubric-specific rewards; constant or front-loaded assignments would be inconsistent with the varying rubric content. Joint on-policy training of μ and π further selects for assignments that yield policy improvement, as only useful step-level signals allow the agent to correct individual steps. The reported analyses already show the rubrics are step-specific and diagnostically useful, supporting that differentiation occurs in practice. We did not add explicit auxiliary losses, but will add a paragraph in §3 clarifying why the rubric-conditioned architecture mitigates under-constraint. revision: partial
Referee: [Experiments / §4] The experimental section reports that ARCO improves the best EM in every setting, yet provides no ablation that isolates the decomposition constraint from the co-evolution mechanism, no statistical significance tests across runs, and no verification that the learned step rewards are non-uniform or predictive of step quality. These omissions leave the load-bearing claim—that the constraint produces useful step-level credit assignment—unverified.

Authors: We acknowledge the absence of an explicit ablation separating the decomposition constraint from co-evolution and the lack of statistical significance tests across independent runs. The existing analyses demonstrate that rubrics are step-specific and useful for diagnosis, which provides indirect evidence that the resulting rewards are non-uniform. Direct verification of reward non-uniformity and predictive power for step quality, together with an ablation and significance tests, would strengthen the central claim. We will add these elements in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks remain independent of training constraint

full rationale

The paper's central claims are empirical EM improvements on HotpotQA, 2WikiMultiHopQA, and MuSiQue using open-source backbones, measured against outcome-, rubric-, and process-reward baselines. The trajectory decomposition constraint (sum of step rewards equals terminal outcome) is a training mechanism for the score head without step-level labels, but this does not reduce the reported benchmark gains to the inputs by construction. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The co-evolution of rubric and scoring heads on on-policy data is a joint optimization procedure whose outputs are externally validated on held-out tasks. This is the common case of a self-contained empirical method against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the trajectory decomposition constraint and the effectiveness of joint parameter-level co-evolution; both are introduced in the paper rather than derived from prior literature.

axioms (1)

domain assumption The sum of step-level rewards equals the terminal outcome reward
Invoked to enable credit assignment without step-level labels

invented entities (1)

Model μ with generation head and score head sharing the same backbone no independent evidence
purpose: Produce per-step criteria and predict rubric-conditioned step rewards
New model component introduced to implement adaptive rubrics

pith-pipeline@v0.9.1-grok · 5782 in / 1402 out tokens · 17846 ms · 2026-06-26T14:32:10.843813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 17 canonical work pages · 7 internal anchors

[1]

Process reward models for llm agents: Practical framework and directions

Sanjiban Choudhury. Process reward models for LLM agents: Practical framework and directions. arXiv preprint arXiv:2502.10325,

work page arXiv
[2]

SimCSE: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910,

2021
[3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Carmo: Dynamic criteria generation for context aware reward modelling

Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Rahul Madhavan, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan. Carmo: Dynamic criteria generation for context aware reward modelling. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 2202–2261,

2025
[6]

Cooper: Co-optimizing policy and reward models in reinforcement learning for large language models.arXiv preprint arXiv:2508.05613,

10 Preprint Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, and Jun Xiao. Cooper: Co-optimizing policy and reward models in reinforcement learning for large language models.arXiv preprint arXiv:2508.05613,

work page arXiv
[7]

LoRA: Low-Rank Adaptation of Large Language Models

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, and Jianqiang Li

URL https://api.semanticscholar.org/CorpusID:235458009. Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, and Jianqiang Li. Efficient language-instructed skill acquisition via reward-policy co-evolution. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 14576–14584, 2025a. Zenan Huang, Yihong Zhuang, Guosh...

work page arXiv
[9]

Reinforcement learning for long-horizon multi-turn search agents

Vivek Kalyan and Martin Andrews. Reinforcement learning for long-horizon multi-turn search agents. arXiv preprint arXiv:2510.24126,

work page arXiv
[10]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a. Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewa...

work page arXiv
[11]

Nils Reimers and Iryna Gurevych

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.arXiv preprint arXiv:2510.07284,

work page arXiv
[12]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, et al. R1-Searcher: Incentivizing the search capability in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Haps: Hierarchical llm routing with joint architecture and parameter search.arXiv preprint arXiv:2601.05903,

Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, and Xu Chen. Haps: Hierarchical llm routing with joint architecture and parameter search.arXiv preprint arXiv:2601.05903,

work page arXiv
[15]

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, et al. Co-evolution of pol- icy and internal reward for language agents.arXiv preprint arXiv:2604.03098, 2026a. Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and r...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Watch every step! LLM agent learning via iterative step-level process refinement

Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Watch every step! LLM agent learning via iterative step-level process refinement. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

2024
[17]

Alternating reinforcement learning for rubric-based reward modeling in non- verifiable LLM post-training.arXiv preprint arXiv:2602.01511,

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non- verifiable LLM post-training.arXiv preprint arXiv:2602.01511,

work page arXiv
[18]

Qwen3 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

2018
[20]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. In ICML 2025 Workshop on Computer Use Agents,

2025
[21]

Chasing the tail: Effective rubric-based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500,

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric-based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500,

work page arXiv
[22]

R-search: Empowering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185,

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Empowering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185,

work page arXiv
[23]

the missing person’s birth date

12 Preprint A TRAININGALGORITHM Algorithm 1 details the full ARCO training loop, covering rollout, rubric generation and scoring, and the co-evolution updates ofπandµ. Algorithm 1ARCO: Adaptive Rubric Co-Evolution Require: Policy π (SFT-initialized), rubric model µ (SFT-initialized), training examples E, retriever R, max stepsT, dense transition epochη 1:...

2004

[1] [1]

Process reward models for llm agents: Practical framework and directions

Sanjiban Choudhury. Process reward models for LLM agents: Practical framework and directions. arXiv preprint arXiv:2502.10325,

work page arXiv

[2] [2]

SimCSE: Simple contrastive learning of sentence embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910,

2021

[3] [3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Carmo: Dynamic criteria generation for context aware reward modelling

Taneesh Gupta, Shivam Shandilya, Xuchao Zhang, Rahul Madhavan, Supriyo Ghosh, Chetan Bansal, Huaxiu Yao, and Saravan Rajmohan. Carmo: Dynamic criteria generation for context aware reward modelling. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 2202–2261,

2025

[6] [6]

Cooper: Co-optimizing policy and reward models in reinforcement learning for large language models.arXiv preprint arXiv:2508.05613,

10 Preprint Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, and Jun Xiao. Cooper: Co-optimizing policy and reward models in reinforcement learning for large language models.arXiv preprint arXiv:2508.05613,

work page arXiv

[7] [7]

LoRA: Low-Rank Adaptation of Large Language Models

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, and Jianqiang Li

URL https://api.semanticscholar.org/CorpusID:235458009. Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, and Jianqiang Li. Efficient language-instructed skill acquisition via reward-policy co-evolution. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 14576–14584, 2025a. Zenan Huang, Yihong Zhuang, Guosh...

work page arXiv

[9] [9]

Reinforcement learning for long-horizon multi-turn search agents

Vivek Kalyan and Martin Andrews. Reinforcement learning for long-horizon multi-turn search agents. arXiv preprint arXiv:2510.24126,

work page arXiv

[10] [10]

Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025a. Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewa...

work page arXiv

[11] [11]

Nils Reimers and Iryna Gurevych

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. Online rubrics elicitation from pairwise comparisons.arXiv preprint arXiv:2510.07284,

work page arXiv

[12] [12]

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, et al. R1-Searcher: Incentivizing the search capability in LLMs via reinforcement learning.arXiv preprint arXiv:2503.05592,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Haps: Hierarchical llm routing with joint architecture and parameter search.arXiv preprint arXiv:2601.05903,

Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, and Xu Chen. Haps: Hierarchical llm routing with joint architecture and parameter search.arXiv preprint arXiv:2601.05903,

work page arXiv

[15] [15]

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, et al. Co-evolution of pol- icy and internal reward for language agents.arXiv preprint arXiv:2604.03098, 2026a. Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy, and r...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Watch every step! LLM agent learning via iterative step-level process refinement

Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, and Sujian Li. Watch every step! LLM agent learning via iterative step-level process refinement. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

2024

[17] [17]

Alternating reinforcement learning for rubric-based reward modeling in non- verifiable LLM post-training.arXiv preprint arXiv:2602.01511,

Ran Xu, Tianci Liu, Zihan Dong, Tony Yu, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, and Haoyu Wang. Alternating reinforcement learning for rubric-based reward modeling in non- verifiable LLM post-training.arXiv preprint arXiv:2602.01511,

work page arXiv

[18] [18]

Qwen3 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

2018

[20] [20]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment. In ICML 2025 Workshop on Computer Use Agents,

2025

[21] [21]

Chasing the tail: Effective rubric-based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500,

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric-based reward modeling for large language model post-training.arXiv preprint arXiv:2509.21500,

work page arXiv

[22] [22]

R-search: Empowering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185,

Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, and Limin Liu. R-search: Empowering llm reasoning with search via multi-reward reinforcement learning.arXiv preprint arXiv:2506.04185,

work page arXiv

[23] [23]

the missing person’s birth date

12 Preprint A TRAININGALGORITHM Algorithm 1 details the full ARCO training loop, covering rollout, rubric generation and scoring, and the co-evolution updates ofπandµ. Algorithm 1ARCO: Adaptive Rubric Co-Evolution Require: Policy π (SFT-initialized), rubric model µ (SFT-initialized), training examples E, retriever R, max stepsT, dense transition epochη 1:...

2004