Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
Pith reviewed 2026-05-20 05:32 UTC · model grok-4.3
pith:KDEMV4O4 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{KDEMV4O4}
Prints a linked pith:KDEMV4O4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
ReBel rewards belief consistency rather than actions to solve credit assignment in long-horizon LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReBel is a process-level reinforcement learning algorithm that maintains explicit structured belief states to summarize interaction history; it applies belief-consistency supervision by turning discrepancies between predicted beliefs and observed feedback into dense self-supervised signals, and it uses belief-aware grouping to compare trajectories under comparable belief states, producing more robust advantage estimates that improve policy learning on long-horizon tasks under partial observability.
What carries the argument
Belief-consistency supervision combined with belief-aware grouping, where structured belief states serve as the central summary of history and the source of dense self-supervised signals without external verifiers.
If this is right
- Dense self-supervised signals derived from belief consistency can replace or augment sparse episode-level rewards in long-horizon settings.
- Grouping trajectories by similar belief states yields lower-variance advantage estimates than grouping by episode identity alone.
- The same belief-modeling approach extends credit assignment improvements to other partially observable interactive benchmarks beyond ALFWorld and WebShop.
- Process-level supervision of this form increases sample efficiency by roughly twofold on the evaluated tasks.
Where Pith is reading between the lines
- The technique may reduce reliance on human-written step-wise annotations when training agents for real-world tasks such as web navigation or household robotics.
- Belief states could serve as a lightweight form of memory that helps agents maintain coherence across very long sessions without increasing context length.
- If the belief predictor itself is trained jointly, the method might generalize to environments where the observation space changes over time.
Load-bearing premise
That discrepancies between an agent's predicted beliefs and later observed feedback can be turned into reliable dense training signals and that these structured belief states can adequately capture the relevant history in partially observable settings.
What would settle it
A controlled ablation on ALFWorld showing that removing the belief-consistency term and belief-aware grouping causes task success to drop back to the level of the episode-level GRPO baseline while holding all other training details fixed.
Figures
read the original abstract
Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReBel, a process-level reinforcement learning algorithm for LLM agents operating in long-horizon partially observable environments. It explicitly models structured belief states to summarize interaction history, converts discrepancies between predicted beliefs and observed feedback into dense self-supervised consistency signals without external step-wise annotations, and applies belief-aware grouping to produce lower-variance advantage estimates. On ALFWorld and WebShop, ReBel is reported to improve task success by up to 20.4 percentage points over the episode-level GRPO baseline while achieving 2.1× better sample efficiency.
Significance. If the empirical gains prove robust, the approach would meaningfully advance credit assignment for agents in POMDPs by turning belief drift into a usable self-supervised signal. The public code release at https://github.com/Fateyetian/Rebel.git is a clear strength that supports reproducibility and future verification.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.
- [Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.
minor comments (1)
- Notation for belief states and the exact form of the consistency loss could be clarified with a short pseudocode or equation block to aid readers unfamiliar with the POMDP formulation.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that will strengthen the statistical rigor and methodological transparency of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: the central numerical claims (20.4 pp success gain and 2.1× sample-efficiency improvement) are presented without error bars, number of random seeds, statistical significance tests, or ablation studies that isolate the contribution of belief-consistency supervision versus belief-aware grouping. These omissions make it impossible to determine whether the reported gains are stable or sensitive to unstated implementation choices.
Authors: We agree that the current presentation of results lacks sufficient statistical characterization and component-wise analysis. In the revised manuscript we will report all main results as means and standard deviations over at least five independent random seeds, include error bars in all figures, perform statistical significance tests (e.g., paired t-tests with p-values) against the GRPO baseline, and add ablation studies that separately remove belief-consistency supervision and belief-aware grouping. These additions will allow readers to assess both stability and the individual contributions of each proposed component. revision: yes
-
Referee: [Method] Method section (belief-consistency supervision): the manuscript does not report separate belief-prediction accuracy metrics, human validation of extracted signals, or details on how discrepancies are quantified (e.g., similarity function or LLM-as-judge). In sparse-observation POMDPs this leaves open the possibility that noisy or self-referential consistency signals are being used, which directly undermines the claim that the supervision is reliable and annotation-free.
Authors: We acknowledge that additional implementation details and validation would improve transparency. In the revision we will expand the method section to include (i) quantitative belief-prediction accuracy on a held-out validation set of trajectories, (ii) the precise formulation used to quantify discrepancies between predicted beliefs and observed feedback, and (iii) qualitative examples of the resulting consistency signals. We will also report a small-scale human evaluation on a representative subset of signals to assess their quality. These changes preserve the annotation-free character of the approach while addressing concerns about potential noise. revision: yes
Circularity Check
No significant circularity; derivation relies on external observation mismatches and benchmark evaluation
full rationale
The abstract describes ReBel's belief-consistency supervision as converting discrepancies between predicted beliefs and observed feedback into self-supervised signals without external annotations or verifiers, with belief-aware grouping for advantage estimates. No equations, derivations, or self-citations are presented that reduce the claimed 20.4pp gains or 2.1× efficiency to fitted parameters, self-definitions, or prior author results by construction. The central claims rest on empirical results from ALFWorld and WebShop rather than tautological reductions, making the approach self-contained against external benchmarks as noted in the reader's assessment.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Belief states can be explicitly modeled to summarize interaction history and guide policy learning in partially observable environments.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals... belief-aware grouping to compare trajectories under similar belief states
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
belief anchor b̃ ∈ {0,1}^K... GS(b̃) = {(i,t) | bi,t = b̃}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
work page 2024
-
[2]
Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems, 37:12461–12495, 2024
work page 2024
-
[3]
Exploration by Random Network Distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025
Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025
-
[5]
Bolin Chen, Ru-Ling Liao, Yan Ye, Jie Chen, Shanzhi Yin, Xinrui Ju, Shiqi Wang, and Yibo Fan. Sparse2dense: A keypoint-driven generative framework for human video compression and vertex prediction, 2025. URLhttps://arxiv.org/abs/2509.23169
-
[6]
Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025
Kevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, Vladlen Koltun, and Philipp Krähenbühl. Reinforcement learning for long-horizon interactive llm agents.arXiv preprint arXiv:2502.01600, 2025
-
[7]
Chung, Moon Jeong Park, and Dongwoo Kim
Youngbin Choi, Min Jae Lee, Saemi Moon, Seunghyuk Cho, C. Chung, Moon Jeong Park, and Dongwoo Kim. In-place feedback: A new paradigm for guiding llms in multi-turn reasoning. ArXiv, abs/2510.00777, 2025
-
[8]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards.ArXiv, abs/2502.01456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024
LI DU, Zhouhao Sun, Xiao Ding, Yixuan Ma, Yang Zhao, Kaitao Qiu, Ting Liu, and Bing Qin. Causal-guided active learning for debiasing large language models.ArXiv, abs/2408.12942, 2024
-
[10]
Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.ArXiv, abs/2506.17419, 2025
-
[11]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Reward shaping to mitigate reward hacking in rlhf, 2026
Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward shaping to mitigate reward hacking in rlhf, 2026. URL https://arxiv.org/abs/2502. 18770
work page 2026
-
[13]
A survey on llm-as-a-judge.The Innovation, 2026
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Sai Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Li-Hua Ni, Wen yuan Gao, Yuanzhuo Wang, and Jian Guo. A survey on llm-as-a-judge.The Innovation, 2026
work page 2026
-
[14]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[15]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020. 10
work page 2020
-
[16]
Reason- ing with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reason- ing with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173, 2023
work page 2023
-
[17]
Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation
Byunghyun Kim, Minyoung Bae, and Jae-Gil Lee. Sample-Efficient Multi-Round Generative Data Augmentation for Long-Tail Instance Segmentation. InAdvances in Neural Informa- tion Processing Systems, 2025. URL https://mlanthology.org/neurips/2025/ kim2025neurips-sampleefficient/
work page 2025
-
[18]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Daniel Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and H...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, and Alane Suhr. Abbel: Llm agents acting through belief bottlenecks expressed in language.ArXiv, abs/2512.20111, 2025
-
[20]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023
work page 2023
-
[21]
Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025
Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, and Jianbin Jiao. Agentic reinforcement learning with implicit step rewards.arXiv preprint arXiv:2509.19199, 2025
-
[23]
Gengchen Mai, Yingjie Hu, Song Gao, Ling Cai, Bruno Martins, Johannes Scholz, Jing Gao, and Krzysztof Janowicz. Symbolic and subsymbolic geoai: Geospatial knowledge graphs and spatially explicit machine learning.Trans. GIS, 26(8):3118–3124, 2022
work page 2022
-
[24]
Augmented Language Models: a Survey
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey.arXiv preprint arXiv:2302.07842, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities
Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Xiaodong Song, and Sharon Li. Uncertainty quantification in llm agents: Foundations, emerging challenges, and opportunities. 2026
work page 2026
-
[26]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[27]
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024
-
[29]
Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning.arXiv preprint arXiv:2411.02337, 2024
-
[30]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
A possibility for implementing curiosity and boredom in model-building neural controllers
Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. InProc. of the international conference on simulation of adaptive behavior: From animals to animats, pages 222–227, 1991
work page 1991
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[34]
Yi Su et al. Crossing the reward bridge: Expanding RL with verifiable rewards across diverse domains.arXiv preprint arXiv:2503.23829, 2025
-
[35]
Maciej ´Swiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma´ndziuk. Monte carlo tree search: A review of recent modifications and applications.Artificial Intelligence Review, 56(3):2497–2562, 2023
work page 2023
-
[36]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y .Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. ArXiv, abs/2312.08935, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, et al. Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning.arXiv preprint arXiv:2509.08755, 2025
-
[40]
Weimin Xiong, Yifan Song, Xiutian Chen, Hao Peng, Bryan Hooi, and Lexing Xie. Watch every step! llm agent learning via iterative step-level process refinement.arXiv preprint arXiv:2406.11176, 2024
-
[41]
Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022
work page 2022
-
[43]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 12
work page 2023
-
[44]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn rl.arXiv preprint arXiv:2402.19446, 2024
-
[46]
belief update–reasoning–action
Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, and Xia Zeng. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2025. URLhttps://arxiv.org/abs/2512.07478. 13 A Experimental Details A.1 Computational Details For both ALFWorld and WebShop, we conduct experiments on 4 ×A800 GPUs using Qwen2.5-1.5B-I...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.