pith. sign in

arxiv: 2605.20061 · v1 · pith:KDEMV4O4new · submitted 2026-05-19 · 💻 cs.CL

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learningLLM agentscredit assignmentpartial observabilitybelief stateslong-horizon tasksself-supervised signalsprocess-level RL
0
0 comments X p. Extension
pith:KDEMV4O4 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{KDEMV4O4}

Prints a linked pith:KDEMV4O4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

ReBel rewards belief consistency rather than actions to solve credit assignment in long-horizon LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReBel, a reinforcement learning method that models structured belief states to summarize an agent's interaction history in partially observable environments. It converts mismatches between these predicted beliefs and later observed feedback into dense self-supervised training signals, avoiding the need for external step-by-step labels. The method also groups trajectories that share similar belief states to produce lower-variance advantage estimates for policy updates. Experiments on ALFWorld and WebShop show gains of up to 20.4 percentage points in task success and 2.1 times better sample efficiency compared with episode-level baselines such as GRPO. A sympathetic reader would care because reliable credit assignment remains a core obstacle for scaling LLM agents on realistic, multi-step tasks where rewards arrive late and observations are incomplete.

Core claim

ReBel is a process-level reinforcement learning algorithm that maintains explicit structured belief states to summarize interaction history; it applies belief-consistency supervision by turning discrepancies between predicted beliefs and observed feedback into dense self-supervised signals, and it uses belief-aware grouping to compare trajectories under comparable belief states, producing more robust advantage estimates that improve policy learning on long-horizon tasks under partial observability.

What carries the argument

Belief-consistency supervision combined with belief-aware grouping, where structured belief states serve as the central summary of history and the source of dense self-supervised signals without external verifiers.

If this is right

  • Dense self-supervised signals derived from belief consistency can replace or augment sparse episode-level rewards in long-horizon settings.
  • Grouping trajectories by similar belief states yields lower-variance advantage estimates than grouping by episode identity alone.
  • The same belief-modeling approach extends credit assignment improvements to other partially observable interactive benchmarks beyond ALFWorld and WebShop.
  • Process-level supervision of this form increases sample efficiency by roughly twofold on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may reduce reliance on human-written step-wise annotations when training agents for real-world tasks such as web navigation or household robotics.
  • Belief states could serve as a lightweight form of memory that helps agents maintain coherence across very long sessions without increasing context length.
  • If the belief predictor itself is trained jointly, the method might generalize to environments where the observation space changes over time.

Load-bearing premise

That discrepancies between an agent's predicted beliefs and later observed feedback can be turned into reliable dense training signals and that these structured belief states can adequately capture the relevant history in partially observable settings.

What would settle it

A controlled ablation on ALFWorld showing that removing the belief-consistency term and belief-aware grouping causes task success to drop back to the level of the episode-level GRPO baseline while holding all other training details fixed.

Figures

Figures reproduced from arXiv: 2605.20061 by Liquan Xiao, Minne Li, Sijie Huang, Wenjie Tang, Yuan Zhou.

Figure 1
Figure 1. Figure 1: Overview of ReBel. ReBel learns belief-aware policies for partially observable long￾horizon tasks by making latent belief explicit and decomposing policy generation into belief, think, and action. It turns sparse terminal rewards into step-wise belief consistency feedback and performs belief-anchor grouping to support stable step-level advantage estimation. (S, A, Ω, T , O, R, γ), where S denotes the laten… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics and per-task performance. (a) ALFWorld [33] training curves. REBEL reaches the final GRPO [32] performance after roughly 35 iterations, corresponding to an approximate 2.1× improvement in sample efficiency. (b) Per-task success rates on ALFWorld [33], sorted by estimated trajectory length; ∆ denotes the improvement of REBEL over GRPO [32]. (c) The gain of REBEL over GRPO [32] increases wi… view at source ↗
Figure 3
Figure 3. Figure 3: Grouping quality and training efficiency. (a) Singleton ratio for REBEL and GiGPO [11] during training. The average group size for GiGPO [11] is shown on the right axis. (b) Average episode length on ALFWorld [33]. REBEL reduces the average episode length from about 29.9 steps to 9.2 steps, a 3.2× reduction. (c) Success rate versus cumulative environment interactions. REBEL reaches 85% rollout success with… view at source ↗
Figure 4
Figure 4. Figure 4: ALFWorld prompt template. WebShop Prompt Template You are an expert autonomous agent operating in the WebShop e-commerce environment. Task: {task_description} Step count: {step_count} Recent history ({history_length} steps): {action_history} Current step: {current_step} Current observation: {current_observation} Use the previous belief state together with the current observation to update the belief, infer… view at source ↗
Figure 5
Figure 5. Figure 5: WebShop prompt template. C Limitations Our study has two main limitations. First, our experiments are conducted on two representative benchmarks and one backbone scale, 1.5B. This setting allows us to evaluate the core hypothesis of REBEL in a controlled and comparable manner, especially in environments where partial observability and intermediate reasoning play an important role. However, it does not full… view at source ↗
Figure 6
Figure 6. Figure 6: Belief drift as a failure mode in partially observable environments. The top row shows a Think-only agent that remains overconfident in an incorrect belief and repeatedly executes invalid actions. The bottom row shows a Belief-Augmented agent that updates its belief from observations, progressively reduces uncertainty, and succeeds in the task. REBEL aims to induce this belief-aware reasoning behavior duri… view at source ↗
read the original abstract

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ReBel, a process-level reinforcement learning algorithm for LLM agents operating in long-horizon partially observable environments. It explicitly models structured belief states to summarize interaction history, converts discrepancies between predicted beliefs and observed feedback into dense self-supervised consistency signals without external step-wise annotations, and applies belief-aware grouping to produce lower-variance advantage estimates. On ALFWorld and WebShop, ReBel is reported to improve task success by up to 20.4 percentage points over the episode-level GRPO baseline while achieving 2.1× better sample efficiency.

Significance. If the empirical gains prove robust, the approach would meaningfully advance credit assignment for agents in POMDPs by turning belief drift into a usable self-supervised signal. The public code release at https://github.com/Fateyetian/Rebel.git is a clear strength that supports reproducibility and future verification.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.
  2. [Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.
minor comments (1)
  1. Notation for belief states and the exact form of the consistency loss could be clarified with a short pseudocode or equation block to aid readers unfamiliar with the POMDP formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that will strengthen the statistical rigor and methodological transparency of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.

    Authors: We agree that the current presentation of results lacks sufficient statistical characterization and component-wise analysis. In the revised manuscript we will report all main results as means and standard deviations over at least five independent random seeds, include error bars in all figures, perform statistical significance tests (e.g., paired t-tests with p-values) against the GRPO baseline, and add ablation studies that separately remove belief-consistency supervision and belief-aware grouping. These additions will allow readers to assess both stability and the individual contributions of each proposed component. revision: yes

  2. Referee: [Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.

    Authors: We acknowledge that additional implementation details and validation would improve transparency. In the revision we will expand the method section to include (i) quantitative belief-prediction accuracy on a held-out validation set of trajectories, (ii) the precise formulation used to quantify discrepancies between predicted beliefs and observed feedback, and (iii) qualitative examples of the resulting consistency signals. We will also report a small-scale human evaluation on a representative subset of signals to assess their quality. These changes preserve the annotation-free character of the approach while addressing concerns about potential noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external observation mismatches and benchmark evaluation

full rationale

The abstract describes ReBel's belief-consistency supervision as converting discrepancies between predicted beliefs and observed feedback into self-supervised signals without external annotations or verifiers, with belief-aware grouping for advantage estimates. No equations, derivations, or self-citations are presented that reduce the claimed 20.4pp gains or 2.1× efficiency to fitted parameters, self-definitions, or prior author results by construction. The central claims rest on empirical results from ALFWorld and WebShop rather than tautological reductions, making the approach self-contained against external benchmarks as noted in the reader's assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the unproven domain assumption that belief states can be structured and predicted accurately enough to generate useful self-supervision signals; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Belief states can be explicitly modeled to summarize interaction history and guide policy learning in partially observable environments.
    Central to converting belief-observation discrepancies into self-supervised signals without external verifiers.

pith-pipeline@v0.9.0 · 5766 in / 1264 out tokens · 46793 ms · 2026-05-20T05:32:02.766388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 14 internal anchors

  1. [1]

    Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

  2. [2]

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024

  3. [3]

    Exploration by Random Network Distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

  4. [4]

    Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

  5. [5]

    Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025

    Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, and Yibo Fan. Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025. URLhttps://arxiv.org/abs/2509.23169

  6. [6]

    Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

    Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025

  7. [7]

    Chung, Moon Jeong Park, and Dongwoo Kim

    Youngbin Choi, Min Jae Lee, Saemi Moon, Seunghyuk Cho, C. Chung, Moon Jeong Park, and Dongwoo Kim. In-place feedback: A new paradigm for guiding llms in multi-turn reasoning. ArXiv, abs/2510.00777, 2025

  8. [8]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.ArXiv, abs/2502.01456, 2025

  9. [9]

    Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

    LI DU, Zhouhao Sun, Xiao Ding, Yixuan Ma, Yang Zhao, Kaitao Qiu, Ting Liu, and Bing Qin. Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024

  10. [10]

    Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

    Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025

  11. [11]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978

  12. [12]

    Reward shaping to mitigate reward hacking in rlhf, 2026

    Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf, 2026. URL https://arxiv.org/abs/2502. 18770

  13. [13]

    A survey on llm-as-a-judge.The Innovation, 2026

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Sai Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Li-Hua Ni, Wen yuan Gao, Yuanzhuo Wang, and Jian Guo. A survey on llm-as-a-judge.The Innovation, 2026

  14. [14]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  15. [15]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 10

  16. [16]

    Reason- ing with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023

  17. [17]

    Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation

    Byunghyun Kim, Minyoung Bae, and Jae-Gil Lee. Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation. InAdvances in Neural Informa- tion Processing Systems, 2025. URL https://mlanthology.org/neurips/2025/ kim2025neurips-sampleefficient/

  18. [18]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and H...

  19. [19]

    Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

    Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr. Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025

  20. [20]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  21. [21]

    Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

  22. [22]

    Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

    Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025

  23. [23]

    Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans

    Gengchen Mai, Yingjie Hu, Song Gao, Ling Cai, Bruno Martins, Johannes Scholz, Jing Gao, and Krzysztof Janowicz. Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans. GIS, 26(8):3118–3124, 2022

  24. [24]

    Augmented Language Models: a Survey

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023

  25. [25]

    Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities

    Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Xiaodong Song, and Sharon Li. Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities. 2026

  26. [26]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  27. [27]

    Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

  28. [29]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024

  29. [30]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025

  30. [31]

    A possibility for implementing curiosity and boredom in model-building neural controllers

    Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991

  31. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [33]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  33. [34]

    Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

    Yi Su et al. Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025

  34. [35]

    Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

    Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma´ndziuk. Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023

  35. [36]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

  36. [37]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y .Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023

  37. [38]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  38. [39]

    Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

    Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025

  39. [40]

    Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

    Weimin Xiong, Yifan Song, Xiutian Chen, Hao Peng, Bryan Hooi, and Lexing Xie. Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024

  40. [41]

    Qwen2.5 Technical Report

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

  41. [42]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  42. [43]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12

  43. [44]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  44. [45]

    Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024

  45. [46]

    belief update–reasoning–action

    Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, and Xia Zeng. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2025. URLhttps://arxiv.org/abs/2512.07478. 13 A Experimental Details A.1 Computational Details For both ALFWorld and WebShop, we conduct experiments on 4 ×A800 GPUs using Qwen2.5-1.5B-I...