From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

Chao Wang; Jiacheng Huang; Ranxu zhang; Rui Zhang; Sun Zhe; Xiaozhou Xu; Yanyong Zhang; Zeyang Li

arxiv: 2605.23382 · v1 · pith:6ET367O7new · submitted 2026-05-22 · 💻 cs.CL

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

Ranxu zhang , zeyang li , Jiacheng Huang , Rui Zhang , Xiaozhou Xu , sun zhe , Yanyong Zhang , Chao Wang This is my paper

Pith reviewed 2026-05-25 04:45 UTC · model grok-4.3

classification 💻 cs.CL

keywords personalized reinforcement learningagentic RLreward decouplingpreference disentanglementskill graph memoryuser-conditioned agentsanchor-based optimization

0 comments

The pith

A unified framework embeds user preferences into agentic RL training by decoupling them from generic task rewards and stabilizing via user-specific anchors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that personalization must be built into the optimization process itself rather than applied after generic training, because standard rewards ignore differences in how users want tasks solved and observed actions mix true preferences with conformity to common patterns. This matters for real agent applications where the same query should trigger different planning or tool choices depending on the user. The method introduces a reward-decoupled policy optimizer with user anchors, a two-stage model to separate preferences from conformity, and a graph memory that evolves and retrieves skills aligned to each user's history. These components close a loop of preference identification, policy updates, and skill accumulation, leading to agents that outperform baselines on the reported tasks.

Core claim

The central claim is that embedding personalization directly into training-time optimization via Personalized Anchor Reward-Decoupled Policy Optimization allows agents to learn from separate generic task-quality rewards and user-specific preference rewards, stabilized by user anchors that handle scale differences, while a two-stage preference-disentangled reward model extracts clean preference signals and a Preference-Aligned Skill Evolution Graph Memory supports retrieval of matching skills, forming a closed loop that produces user-conditioned behavior.

What carries the argument

Personalized Anchor Reward-Decoupled Policy Optimization (PARPO), which separates generic task rewards from personalized preference rewards and applies user-specific anchors for stable updates under varying reward magnitudes.

If this is right

Agents can adapt planning strategies and tool selections to individual users without generic correctness signals overriding personal preferences.
The two-stage reward model produces supervision signals that isolate user preferences from conformity in training data.
Graph memory structures allow retrieval of previously learned skills that match a given user's preference profile.
The closed loop of identification, optimization, and skill evolution supports iterative improvement as more user-specific data arrives.
The approach yields higher task success under personalized evaluation than standard memory or RL methods on ETAPP, ETAPP-Hard, and SJAgent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling might allow agents to handle multiple users in one session by switching anchors without retraining the base policy.
If the graph memory grows with real interactions, it could reduce reliance on large prompt-based memories for long-horizon personalization.
Extending the anchor mechanism to continuous user traits rather than discrete IDs could support generalization to unseen users.
Deployment in live systems would test whether the disentanglement remains stable when user feedback contains noise or strategic behavior.

Load-bearing premise

Observed user behaviors contain disentangleable preference signals that a two-stage reward model can separate from conformity effects, enabling stable optimization despite differing reward scales across users.

What would settle it

A controlled test set where preference labels are known but the two-stage model cannot recover them above baseline accuracy, resulting in no performance gain or worse results than non-personalized RL on the same agent tasks.

Figures

Figures reproduced from arXiv: 2605.23382 by Chao Wang, Jiacheng Huang, Ranxu zhang, Rui Zhang, Sun Zhe, Xiaozhou Xu, Yanyong Zhang, Zeyang Li.

**Figure 2.** Figure 2: Overview of the proposed personalized Agentic RL framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Blinded evaluation on 20 personalized ETAPP tasks. Left: human scores by dimension. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics and skill evolution analysis of Qwen3-8B on ETAPP. Top: RL training [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: RL training dynamics of Qwen3-8B on ETAPP-Hard, comparing GRPO, GSPO, GiGPO, [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 6.** Figure 6: Personalized reward decomposition of Qwen3-8B on ETAPP-Hard, comparing GRPO, [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Final EMA scores at the last training step across different reward dimensions on ETAPP [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

read the original abstract

Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts forward a concrete framework (PARPO + two-stage disentanglement + PSGM) for handling user-specific preferences in agentic RL, with experiments on ETAPP and SJAgent, but the reported gains rest on an untested disentanglement step whose contribution is not isolated.

read the letter

The core of this work is a closed-loop setup that separates task-quality rewards from user preferences via anchors in PARPO, runs a two-stage model to pull preferences apart from conformity effects, and stores skills in a preference-aligned graph memory (PSGM). That combination is new as a single package for agentic settings, and the authors supply code and data, which helps anyone who wants to inspect or extend it. The experiments claim steady wins over memory and RL baselines on ETAPP, ETAPP-Hard, and SJAgent, which at least shows the system runs and produces measurable differences on those tasks. The problem itself is real: generic rewards do not fit users with different planning styles, and flat memory does not help retrieval. The architecture tries to fix that directly in the training loop rather than post-hoc. The main soft spot is that the gains are attributed to the disentanglement and anchoring steps, yet the abstract and reported results give no ablation or control that shows those pieces, rather than other implementation choices, are responsible. Without that isolation, it is hard to know whether the framework's novelty is load-bearing or incidental. The assumption that preferences can be stably separated from conformity also sits at the center; if the full paper lacks checks on that separation under varying reward scales, the optimization could be less stable than claimed. This is aimed at people already working on user-adaptive agents or personalized RL. A reader who needs a starting architecture for preference-aware tool use or planning could extract useful pieces even if they later modify the reward model. It is coherent enough on its own terms to deserve referee time; the experiments are on concrete benchmarks and the proposal is structured, so reviewers can pressure-test the controls and ablations.

Referee Report

1 major / 1 minor

Summary. The paper proposes a unified framework for personalized agentic reinforcement learning that embeds user preferences into training-time optimization. Its core components are Personalized Anchor Reward-Decoupled Policy Optimization (PARPO), which separates generic task-quality rewards from user-specific preference rewards using anchors for stability under heterogeneous scales; a two-stage preference-disentangled reward model to isolate preferences from conformity effects; and Preference-Aligned Skill Evolution Graph Memory (PSGM) for structured, preference-aligned skill retrieval. The framework forms a closed loop of preference identification, policy optimization, and skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent benchmarks report consistent outperformance over strong memory and RL baselines, with code and data provided in supplementary materials.

Significance. If the reported gains hold under rigorous controls, the work would address a practically important limitation in agentic RL by enabling user-conditioned behavior without generic rewards. The explicit decoupling mechanism, closed-loop design, and release of code/data are strengths that support reproducibility and further testing. The approach could influence personalized agent systems in domains with heterogeneous user preferences, provided the disentanglement step proves robust.

major comments (1)

[Experiments and § on reward model] The central experimental claim (outperformance on ETAPP/ETAPP-Hard/SJAgent) rests on the two-stage preference-disentangled reward model reliably separating user preferences from conformity effects. No section, equation, or table in the provided abstract or methods description reports an ablation isolating the disentanglement stage or a control for reward-scale heterogeneity; without this, it is unclear whether the reported gains are attributable to PARPO/PSGM or to the reward model itself.

minor comments (1)

[Abstract] The abstract states that 'code and data are included in the supplementary materials' but provides no dataset descriptions, error bars, or statistical significance tests for the benchmark results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer isolation of the reward model's contributions. We address the major comment point by point below and commit to revisions that strengthen the experimental section.

read point-by-point responses

Referee: The central experimental claim (outperformance on ETAPP/ETAPP-Hard/SJAgent) rests on the two-stage preference-disentangled reward model reliably separating user preferences from conformity effects. No section, equation, or table in the provided abstract or methods description reports an ablation isolating the disentanglement stage or a control for reward-scale heterogeneity; without this, it is unclear whether the reported gains are attributable to PARPO/PSGM or to the reward model itself.

Authors: We agree that the manuscript as described does not report an explicit ablation isolating the two-stage preference-disentangled reward model or a dedicated control for reward-scale heterogeneity. The current experiments demonstrate overall framework gains but do not decompose the reward model's role from PARPO and PSGM. In revision we will add a dedicated ablation subsection (with new tables) comparing (i) the full model, (ii) PARPO/PSGM with a single-stage reward model, and (iii) variants under uniform vs. anchor-based scaling. This will directly address attribution of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and supplied text describe a high-level framework (PARPO, two-stage reward model, PSGM) and report experimental outperformance on ETAPP/ETAPP-Hard/SJAgent without any equations, parameter-fitting steps, self-citations used as load-bearing premises, or derivations that reduce to inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or ansatzes smuggled via prior work are present. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes disentangleable preferences and stable anchor-based optimization but supplies no details.

pith-pipeline@v0.9.0 · 5764 in / 1081 out tokens · 26349 ms · 2026-05-25T04:45:54.187377+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

URLhttps://openreview.net/forum?id=lNmZrawUMu. Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.17746 2026
[2]

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, et al

URLhttps://openreview.net/forum?id=fgCOkyJG3f. Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, et al. Learning personalized agents from human feedback.arXiv preprint arXiv:2602.16173, 2026. Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin ZHU, Xiaoyu Shen, Wenjie ...

work page arXiv 2026
[3]

Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al

URLhttps://openreview.net/forum?id=kAzqfqsCC5. Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents.arXiv preprint arXiv:2511.13593, 2025. Miao Su, Yucan Guo, Zhongni Hou, Long Bai, Zixu...

work page arXiv 2025
[4]

SimpleMem: Efficient Lifelong Memory for LLM Agents

URLhttps://openreview.net/forum?id=XY8AaxDSLb. Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026b. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents w...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

This justifies the necessity of personalized optimization

Under heterogeneous user preferences, the optimal value of personalized decision-making is no smaller than that of user-agnostic decision-making, and the gain can be explicitly charac- terized by preference dispersion. This justifies the necessity of personalized optimization

work page
[6]

In personalized settings, standard GRPO uses pooled baselines and pooled normalization scales for relative comparison, which introduces cross-user preference mixing bias governed by the global heterogeneityH(q)

work page
[7]

More generally, when local preference grouping is available, PARPO can further shrink the dominant error term from global heterogeneityH(q)to local heterogeneityH G(q)

In the implementation studied in this paper, PARPO reduces this bias primarily through reward decomposition and user-specific anchor calibration, yielding an individual-level personalized advantage-estimation bound controlled by anchor error and conservative margin. More generally, when local preference grouping is available, PARPO can further shrink the ...

work page
[8]

Multi-tool orchestration: require using 3–5 different tool categories together

work page
[9]

Deep personalisation: the assistant must deeply leverage the user’s profile, preferences, and current data

work page
[10]

Implicit constraints: the user does NOT explicitly state all constraints; the assistant must infer them from context, such as schedule conflicts, dietary restrictions, budget limits, and health conditions

work page
[11]

Multi-step reasoning: information from one tool call is needed to decide what to do with another tool

work page
[12]

{user_name}

Conflict resolution: the task involves trade-offs or requires the assistant to propose alternatives. Important rules: • The query should sound natural, like a real person talking to their AI assistant. • The query should be 1–3 sentences, not a detailed specification. • The complexity should come from the context rather than the query length. • The query ...

work page
[13]

Procedure Analysis: assess the AI assistant’s entire solution process, including tool usage, logic, and final output. 29

work page
[14]

Personalization Assessment: evaluate whether the assistant considered the user’s specific preferences, profile details, and context

work page
[15]

Analysis Format: Your analysis should follow this JSON structure: {output_format} Evaluation Input:

Proactivity Behavior Assessment: evaluate whether the assistant anticipated additional needs or proposed meaningful helpful actions. Analysis Format: Your analysis should follow this JSON structure: {output_format} Evaluation Input:

work page
[16]

User Profile: {profile}

work page
[17]

Personal LLM Assistant Solution: {output} SJAgent.SJAgent is evaluated with an LLM-as-a-judge protocol. The judge reads the full trajectory, including the merchant query, merchant profile, planner output, retrieved evidence, intermediate analyses, and final report, and assigns five scores, each ranging from 0 to 4: Data Authenticity, Business Logic, Merch...

work page

[1] [1]

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

URLhttps://openreview.net/forum?id=lNmZrawUMu. Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165, 2026. Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Ji...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.17746 2026

[2] [2]

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, et al

URLhttps://openreview.net/forum?id=fgCOkyJG3f. Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, et al. Learning personalized agents from human feedback.arXiv preprint arXiv:2602.16173, 2026. Hongru Cai, Yongqi Li, Wenjie Wang, Fengbin ZHU, Xiaoyu Shen, Wenjie ...

work page arXiv 2026

[3] [3]

Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al

URLhttps://openreview.net/forum?id=kAzqfqsCC5. Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, et al. O-mem: Omni memory system for personalized, long horizon, self-evolving agents.arXiv preprint arXiv:2511.13593, 2025. Miao Su, Yucan Guo, Zhongni Hou, Long Bai, Zixu...

work page arXiv 2025

[4] [4]

SimpleMem: Efficient Lifelong Memory for LLM Agents

URLhttps://openreview.net/forum?id=XY8AaxDSLb. Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026b. Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents w...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

This justifies the necessity of personalized optimization

Under heterogeneous user preferences, the optimal value of personalized decision-making is no smaller than that of user-agnostic decision-making, and the gain can be explicitly charac- terized by preference dispersion. This justifies the necessity of personalized optimization

work page

[6] [6]

In personalized settings, standard GRPO uses pooled baselines and pooled normalization scales for relative comparison, which introduces cross-user preference mixing bias governed by the global heterogeneityH(q)

work page

[7] [7]

More generally, when local preference grouping is available, PARPO can further shrink the dominant error term from global heterogeneityH(q)to local heterogeneityH G(q)

In the implementation studied in this paper, PARPO reduces this bias primarily through reward decomposition and user-specific anchor calibration, yielding an individual-level personalized advantage-estimation bound controlled by anchor error and conservative margin. More generally, when local preference grouping is available, PARPO can further shrink the ...

work page

[8] [8]

Multi-tool orchestration: require using 3–5 different tool categories together

work page

[9] [9]

Deep personalisation: the assistant must deeply leverage the user’s profile, preferences, and current data

work page

[10] [10]

Implicit constraints: the user does NOT explicitly state all constraints; the assistant must infer them from context, such as schedule conflicts, dietary restrictions, budget limits, and health conditions

work page

[11] [11]

Multi-step reasoning: information from one tool call is needed to decide what to do with another tool

work page

[12] [12]

{user_name}

Conflict resolution: the task involves trade-offs or requires the assistant to propose alternatives. Important rules: • The query should sound natural, like a real person talking to their AI assistant. • The query should be 1–3 sentences, not a detailed specification. • The complexity should come from the context rather than the query length. • The query ...

work page

[13] [13]

Procedure Analysis: assess the AI assistant’s entire solution process, including tool usage, logic, and final output. 29

work page

[14] [14]

Personalization Assessment: evaluate whether the assistant considered the user’s specific preferences, profile details, and context

work page

[15] [15]

Analysis Format: Your analysis should follow this JSON structure: {output_format} Evaluation Input:

Proactivity Behavior Assessment: evaluate whether the assistant anticipated additional needs or proposed meaningful helpful actions. Analysis Format: Your analysis should follow this JSON structure: {output_format} Evaluation Input:

work page

[16] [16]

User Profile: {profile}

work page

[17] [17]

Personal LLM Assistant Solution: {output} SJAgent.SJAgent is evaluated with an LLM-as-a-judge protocol. The judge reads the full trajectory, including the merchant query, merchant profile, planner output, retrieved evidence, intermediate analyses, and final report, and assigns five scores, each ranging from 0 to 4: Data Authenticity, Business Logic, Merch...

work page