Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

arxiv: 2605.20061 · v1 · pith:KDEMV4O4new · submitted 2026-05-19 · 💻 cs.CL

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

Wenjie Tang , Minne Li , Sijie Huang , Liquan Xiao , Yuan Zhou This is my paper

Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learningLLM agentscredit assignmentpartial observabilitybelief stateslong-horizon tasksself-supervised signalsprocess-level RL

0 comments p. Extension

pith:KDEMV4O4 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{KDEMV4O4}

Prints a linked pith:KDEMV4O4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

ReBel rewards belief consistency rather than actions to solve credit assignment in long-horizon LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReBel, a reinforcement learning method that models structured belief states to summarize an agent's interaction history in partially observable environments. It converts mismatches between these predicted beliefs and later observed feedback into dense self-supervised training signals, avoiding the need for external step-by-step labels. The method also groups trajectories that share similar belief states to produce lower-variance advantage estimates for policy updates. Experiments on ALFWorld and WebShop show gains of up to 20.4 percentage points in task success and 2.1 times better sample efficiency compared with episode-level baselines such as GRPO. A sympathetic reader would care because reliable credit assignment remains a core obstacle for scaling LLM agents on realistic, multi-step tasks where rewards arrive late and observations are incomplete.

Core claim

ReBel is a process-level reinforcement learning algorithm that maintains explicit structured belief states to summarize interaction history; it applies belief-consistency supervision by turning discrepancies between predicted beliefs and observed feedback into dense self-supervised signals, and it uses belief-aware grouping to compare trajectories under comparable belief states, producing more robust advantage estimates that improve policy learning on long-horizon tasks under partial observability.

What carries the argument

Belief-consistency supervision combined with belief-aware grouping, where structured belief states serve as the central summary of history and the source of dense self-supervised signals without external verifiers.

If this is right

Dense self-supervised signals derived from belief consistency can replace or augment sparse episode-level rewards in long-horizon settings.
Grouping trajectories by similar belief states yields lower-variance advantage estimates than grouping by episode identity alone.
The same belief-modeling approach extends credit assignment improvements to other partially observable interactive benchmarks beyond ALFWorld and WebShop.
Process-level supervision of this form increases sample efficiency by roughly twofold on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may reduce reliance on human-written step-wise annotations when training agents for real-world tasks such as web navigation or household robotics.
Belief states could serve as a lightweight form of memory that helps agents maintain coherence across very long sessions without increasing context length.
If the belief predictor itself is trained jointly, the method might generalize to environments where the observation space changes over time.

Load-bearing premise

That discrepancies between an agent's predicted beliefs and later observed feedback can be turned into reliable dense training signals and that these structured belief states can adequately capture the relevant history in partially observable settings.

What would settle it

A controlled ablation on ALFWorld showing that removing the belief-consistency term and belief-aware grouping causes task success to drop back to the level of the episode-level GRPO baseline while holding all other training details fixed.

Figures

Figures reproduced from arXiv: 2605.20061 by Liquan Xiao, Minne Li, Sijie Huang, Wenjie Tang, Yuan Zhou.

**Figure 1.** Figure 1: Overview of ReBel. ReBel learns belief-aware policies for partially observable longhorizon tasks by making latent belief explicit and decomposing policy generation into belief, think, and action. It turns sparse terminal rewards into step-wise belief consistency feedback and performs belief-anchor grouping to support stable step-level advantage estimation. (S, A, Ω, T , O, R, γ), where S denotes the laten… view at source ↗

**Figure 2.** Figure 2: Training dynamics and per-task performance. (a) ALFWorld [33] training curves. REBEL reaches the final GRPO [32] performance after roughly 35 iterations, corresponding to an approximate 2.1× improvement in sample efficiency. (b) Per-task success rates on ALFWorld [33], sorted by estimated trajectory length; ∆ denotes the improvement of REBEL over GRPO [32]. (c) The gain of REBEL over GRPO [32] increases wi… view at source ↗

**Figure 3.** Figure 3: Grouping quality and training efficiency. (a) Singleton ratio for REBEL and GiGPO [11] during training. The average group size for GiGPO [11] is shown on the right axis. (b) Average episode length on ALFWorld [33]. REBEL reduces the average episode length from about 29.9 steps to 9.2 steps, a 3.2× reduction. (c) Success rate versus cumulative environment interactions. REBEL reaches 85% rollout success with… view at source ↗

**Figure 4.** Figure 4: ALFWorld prompt template. WebShop Prompt Template You are an expert autonomous agent operating in the WebShop e-commerce environment. Task: {task_description} Step count: {step_count} Recent history ({history_length} steps): {action_history} Current step: {current_step} Current observation: {current_observation} Use the previous belief state together with the current observation to update the belief, infer… view at source ↗

**Figure 5.** Figure 5: WebShop prompt template. C Limitations Our study has two main limitations. First, our experiments are conducted on two representative benchmarks and one backbone scale, 1.5B. This setting allows us to evaluate the core hypothesis of REBEL in a controlled and comparable manner, especially in environments where partial observability and intermediate reasoning play an important role. However, it does not full… view at source ↗

**Figure 6.** Figure 6: Belief drift as a failure mode in partially observable environments. The top row shows a Think-only agent that remains overconfident in an incorrect belief and repeatedly executes invalid actions. The bottom row shows a Belief-Augmented agent that updates its belief from observations, progressively reduces uncertainty, and succeeds in the task. REBEL aims to induce this belief-aware reasoning behavior duri… view at source ↗

read the original abstract

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReBel gets some mileage out of belief-consistency signals for credit assignment in long-horizon LLM agents, but the abstract leaves the robustness of those signals untested.

read the letter

ReBel's main idea is to track structured belief states over time in partially observable settings and convert mismatches between those beliefs and later observations into a dense self-supervised reward. This is meant to give better credit assignment than plain episode-level GRPO without needing step-by-step human labels or external verifiers. They also group trajectories by similar belief states to get lower-variance advantage estimates. On ALFWorld and WebShop the numbers look decent: up to 20.4 points higher success and 2.1 times better sample efficiency than the baseline.

Referee Report

2 major / 1 minor

Summary. The paper proposes ReBel, a process-level reinforcement learning algorithm for LLM agents operating in long-horizon partially observable environments. It explicitly models structured belief states to summarize interaction history, converts discrepancies between predicted beliefs and observed feedback into dense self-supervised consistency signals without external step-wise annotations, and applies belief-aware grouping to produce lower-variance advantage estimates. On ALFWorld and WebShop, ReBel is reported to improve task success by up to 20.4 percentage points over the episode-level GRPO baseline while achieving 2.1× better sample efficiency.

Significance. If the empirical gains prove robust, the approach would meaningfully advance credit assignment for agents in POMDPs by turning belief drift into a usable self-supervised signal. The public code release at https://github.com/Fateyetian/Rebel.git is a clear strength that supports reproducibility and future verification.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.
[Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.

minor comments (1)

Notation for belief states and the exact form of the consistency loss could be clarified with a short pseudocode or equation block to aid readers unfamiliar with the POMDP formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that will strengthen the statistical rigor and methodological transparency of the manuscript.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.

Authors: We agree that the current presentation of results lacks sufficient statistical characterization and component-wise analysis. In the revised manuscript we will report all main results as means and standard deviations over at least five independent random seeds, include error bars in all figures, perform statistical significance tests (e.g., paired t-tests with p-values) against the GRPO baseline, and add ablation studies that separately remove belief-consistency supervision and belief-aware grouping. These additions will allow readers to assess both stability and the individual contributions of each proposed component. revision: yes
Referee: [Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.

Authors: We acknowledge that additional implementation details and validation would improve transparency. In the revision we will expand the method section to include (i) quantitative belief-prediction accuracy on a held-out validation set of trajectories, (ii) the precise formulation used to quantify discrepancies between predicted beliefs and observed feedback, and (iii) qualitative examples of the resulting consistency signals. We will also report a small-scale human evaluation on a representative subset of signals to assess their quality. These changes preserve the annotation-free character of the approach while addressing concerns about potential noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external observation mismatches and benchmark evaluation

full rationale

The abstract describes ReBel's belief-consistency supervision as converting discrepancies between predicted beliefs and observed feedback into self-supervised signals without external annotations or verifiers, with belief-aware grouping for advantage estimates. No equations, derivations, or self-citations are presented that reduce the claimed 20.4pp gains or 2.1× efficiency to fitted parameters, self-definitions, or prior author results by construction. The central claims rest on empirical results from ALFWorld and WebShop rather than tautological reductions, making the approach self-contained against external benchmarks as noted in the reader's assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the unproven domain assumption that belief states can be structured and predicted accurately enough to generate useful self-supervision signals; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Belief states can be explicitly modeled to summarize interaction history and guide policy learning in partially observable environments.
Central to converting belief-observation discrepancies into self-supervised signals without external verifiers.

pith-pipeline@v0.9.0 · 5766 in / 1264 out tokens · 46793 ms · 2026-05-20T05:32:02.766388+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals... belief-aware grouping to compare trajectories under similar belief states
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

belief anchor b̃ ∈ {0,1}^K... GS(b̃) = {(i,t) | bi,t = b̃}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 14 internal anchors

[1]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024
[2]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

work page 2024
[3]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025
[5]

Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025

Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, and Yibo Fan. Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025. URLhttps://arxiv.org/abs/2509.23169

work page arXiv 2025
[6]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

work page arXiv 2025
[7]

Chung, Moon Jeong Park, and Dongwoo Kim

Youngbin Choi, Min Jae Lee, Saemi Moon, Seunghyuk Cho, C. Chung, Moon Jeong Park, and Dongwoo Kim. In-place feedback: A new paradigm for guiding llms in multi-turn reasoning. ArXiv, abs/2510.00777, 2025

work page arXiv 2025
[8]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.ArXiv, abs/2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

LI DU, Zhouhao Sun, Xiao Ding, Yixuan Ma, Yang Zhao, Kaitao Qiu, Ting Liu, and Bing Qin. Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

work page arXiv 2024
[10]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

work page arXiv 2025
[11]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Reward shaping to mitigate reward hacking in rlhf, 2026

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf, 2026. URL https://arxiv.org/abs/2502. 18770

work page 2026
[13]

A survey on llm-as-a-judge.The Innovation, 2026

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Sai Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Li-Hua Ni, Wen yuan Gao, Yuanzhuo Wang, and Jian Guo. A survey on llm-as-a-judge.The Innovation, 2026

work page 2026
[14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[15]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 10

work page 2020
[16]

Reason- ing with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

work page 2023
[17]

Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation

Byunghyun Kim, Minyoung Bae, and Jae-Gil Lee. Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation. InAdvances in Neural Informa- tion Processing Systems, 2025. URL https://mlanthology.org/neurips/2025/ kim2025neurips-sampleefficient/

work page 2025
[18]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and H...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr. Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

work page arXiv 2025
[20]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[21]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

work page arXiv 2025
[23]

Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans

Gengchen Mai, Yingjie Hu, Song Gao, Ling Cai, Bruno Martins, Johannes Scholz, Jing Gao, and Krzysztof Janowicz. Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans. GIS, 26(8):3118–3124, 2022

work page 2022
[24]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities

Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Xiaodong Song, and Sharon Li. Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities. 2026

work page 2026
[26]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[27]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

work page arXiv 2024
[29]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024
[30]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

work page 1991
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[34]

Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

Yi Su et al. Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

work page arXiv 2025
[35]

Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma´ndziuk. Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

work page 2023
[36]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y .Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

work page arXiv 2025
[40]

Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

Weimin Xiong, Yifan Song, Xiutian Chen, Hao Peng, Bryan Hooi, and Lexing Xie. Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

work page arXiv 2024
[41]

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[43]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12

work page 2023
[44]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

work page arXiv 2024
[46]

belief update–reasoning–action

Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, and Xia Zeng. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2025. URLhttps://arxiv.org/abs/2512.07478. 13 A Experimental Details A.1 Computational Details For both ALFWorld and WebShop, we conduct experiments on 4 ×A800 GPUs using Qwen2.5-1.5B-I...

work page arXiv 2025

[1] [1]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024

[2] [2]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

work page 2024

[3] [3]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

work page arXiv 2025

[5] [5]

Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025

Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, and Yibo Fan. Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025. URLhttps://arxiv.org/abs/2509.23169

work page arXiv 2025

[6] [6]

Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

work page arXiv 2025

[7] [7]

Chung, Moon Jeong Park, and Dongwoo Kim

Youngbin Choi, Min Jae Lee, Saemi Moon, Seunghyuk Cho, C. Chung, Moon Jeong Park, and Dongwoo Kim. In-place feedback: A new paradigm for guiding llms in multi-turn reasoning. ArXiv, abs/2510.00777, 2025

work page arXiv 2025

[8] [8]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.ArXiv, abs/2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

LI DU, Zhouhao Sun, Xiao Ding, Yixuan Ma, Yang Zhao, Kaitao Qiu, Ting Liu, and Bing Qin. Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

work page arXiv 2024

[10] [10]

Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

work page arXiv 2025

[11] [11]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Reward shaping to mitigate reward hacking in rlhf, 2026

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf, 2026. URL https://arxiv.org/abs/2502. 18770

work page 2026

[13] [13]

A survey on llm-as-a-judge.The Innovation, 2026

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Sai Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Li-Hua Ni, Wen yuan Gao, Yuanzhuo Wang, and Jian Guo. A survey on llm-as-a-judge.The Innovation, 2026

work page 2026

[14] [14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[15] [15]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 10

work page 2020

[16] [16]

Reason- ing with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

work page 2023

[17] [17]

Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation

Byunghyun Kim, Minyoung Bae, and Jae-Gil Lee. Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation. InAdvances in Neural Informa- tion Processing Systems, 2025. URL https://mlanthology.org/neurips/2025/ kim2025neurips-sampleefficient/

work page 2025

[18] [18]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and H...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr. Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

work page arXiv 2025

[20] [20]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[21] [21]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

work page arXiv 2025

[23] [23]

Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans

Gengchen Mai, Yingjie Hu, Song Gao, Ling Cai, Bruno Martins, Johannes Scholz, Jing Gao, and Krzysztof Janowicz. Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans. GIS, 26(8):3118–3124, 2022

work page 2022

[24] [24]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities

Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Xiaodong Song, and Sharon Li. Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities. 2026

work page 2026

[26] [26]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[27] [27]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

work page arXiv 2024

[28] [29]

Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024

[29] [30]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [31]

A possibility for implementing curiosity and boredom in model-building neural controllers

Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

work page 1991

[31] [32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [33]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[33] [34]

Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

Yi Su et al. Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

work page arXiv 2025

[34] [35]

Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma´ndziuk. Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

work page 2023

[35] [36]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [37]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y .Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [38]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [39]

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

work page arXiv 2025

[39] [40]

Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

Weimin Xiong, Yifan Song, Xiutian Chen, Hao Peng, Bryan Hooi, and Lexing Xie. Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

work page arXiv 2024

[40] [41]

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022

[42] [43]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12

work page 2023

[43] [44]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

work page arXiv 2024

[45] [46]

belief update–reasoning–action

Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, and Xia Zeng. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2025. URLhttps://arxiv.org/abs/2512.07478. 13 A Experimental Details A.1 Computational Details For both ALFWorld and WebShop, we conduct experiments on 4 ×A800 GPUs using Qwen2.5-1.5B-I...

work page arXiv 2025