Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

Anthony L. Caterini; Brendan Leigh Ross; Dae Yon Hwang; Jesse C. Cresswell; No\"el Vouitsis; Raunaq Suri; Valentin Villecroze

arxiv: 2606.05296 · v1 · pith:N64AJMSCnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

Dae Yon Hwang , Raunaq Suri , Valentin Villecroze , Anthony L. Caterini , Jesse C. Cresswell , No\"el Vouitsis , Brendan Leigh Ross This is my paper

Pith reviewed 2026-06-28 07:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords black-box LLM agentssequential Monte Carloreinforcement learningtest-time optimizationAgentGym benchmarkvalue functionoptimal policy sampling

0 comments

The pith

Black-box LLM agents can be optimized by sampling from the optimal policy using sequential Monte Carlo at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the equivalence between reinforcement learning and Bayesian inference lets one optimize black-box agents without touching their parameters. The fixed LLM defines a prior over trajectories, and sequential Monte Carlo draws samples from the resulting posterior by using a learned value function to steer toward higher-reward paths. This matters because state-of-the-art models are often available only through APIs. Experiments on three AgentGym environments show gains over prompting baselines and scaling advantages over Group Relative Policy Optimization as more test-time samples are used.

Core claim

Agentic Monte Carlo demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents by sampling from the optimal policy using Sequential Monte Carlo. The optimal policy is treated as a posterior over trajectories whose prior is the fixed black-box LLM agent. A value function is learned to steer the sampling process while leaving the underlying black-box model unchanged, yielding measurable improvements on AgentGym tasks.

What carries the argument

Sequential Monte Carlo sampling from the posterior over trajectories, with a learned value function that steers the black-box LLM without changing its parameters.

If this is right

Significant improvements over prompting baselines on three diverse AgentGym environments.
Outperforms Group Relative Policy Optimization when test-time compute is scaled up.
The black-box LLM remains fixed while optimization occurs entirely through guided sampling.
Principled RL-style optimization becomes possible for API-only agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling approach could extend to other black-box decision systems that admit a prior-over-trajectories view.
Better value-function approximators might reduce the number of samples needed to reach a given performance level.
Observed scaling with test-time compute suggests similar Monte Carlo steering could apply to non-LLM agents.

Load-bearing premise

The equivalence between RL and Bayesian inference holds for black-box agents, so the optimal policy can be treated as a posterior whose prior is the fixed LLM.

What would settle it

Performance on AgentGym tasks stays flat or declines as the Monte Carlo sample budget increases, or the learned value function shows no advantage over uniform random steering of trajectories.

Figures

Figures reproduced from arXiv: 2606.05296 by Anthony L. Caterini, Brendan Leigh Ross, Dae Yon Hwang, Jesse C. Cresswell, No\"el Vouitsis, Raunaq Suri, Valentin Villecroze.

**Figure 1.** Figure 1: Comparison between AMC and other agentic paradigms. AMC facilitates task-specificity for black-box LLM policies using a learned lightweight value function. On the other hand, training a white-box model with RL imposes constraints on the choice and scale of the base policy. agents, Reinforcement Learning (RL) (Sutton et al., 1998; Christiano et al., 2017; Ziegler et al., 2019) has become the dominant traini… view at source ↗

**Figure 2.** Figure 2: Visual representation of AMC for N = 3 trajectories. Importance weights w (i) t are determined using the value function Vθ, where lower-weighted trajectories (e.g., s (1)) are more likely to be pruned than higher-weighted ones (s (2), s(3)) during resampling. These results position AMC with black-box models as a viable alternative to gradient-based RL in GPUconstrained settings. 2. Agentic Monte Carlo Th… view at source ↗

**Figure 3.** Figure 3: Comparisons to GRPO on SciWorld. Left: AMC and Best-of-N with a GPT-5.1 policy and a Qwen-2.5-7B-based value function, compared to GRPO with a Qwen-2.5-7B backbone (highlighting the advantage of using AMC with black-box models). Right: AMC and Best-of-N with a Qwen-2.5-3B policy and value function, compared to GRPO with the same backbone [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Best-of-N vs. AMC across tasks. Left: WebShop with Llama-3.2-11B policy, value function. Middle: SciWorld with Llama-3.1-8B policy, value function. Right: TextCraft with Llama-3.2-11B policy, value function [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: WebShop performance across different resampling steps. 4 8 12 16 4,8 4,12 4,16 8,12 8,16 12,16 4,8,12 4,8,16 4,12,16 8,12,16 4,8,12,16 Resampling Step Configurations 0.180 0.210 0.240 0.270 Average Score 1-Step Resampling 2-Step Resampling 3-Step Resampling 4-Step Resampling [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: SciWorld performance across different resampling steps. Our ablation study on resampling step configurations with 5 trajectories ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: TextCraft performance across different resampling steps. ( PN i=1( ˜w (i) t ) 2 ) −1 < N ρ for a given threshold ρ ∈ (0, 1). While ESS is designed to mitigate trajectory degeneracy by filtering low-weight samples, [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of agent trajectories in WebShop. We highlight the divergence at the resampling step 6, where the value function differentiates between promising and non-promising ones. Case 1: Promising (Success) Goal: Find me hand crafted candy & chocolate for gift set, valentine day with style: happy birthday truffle chocolate, and price lower than 60.00 dollars. [STEP 1] Action: search[hand crafted candy gi… view at source ↗

read the original abstract

LLM agents operate in two distinct regimes: open-weight agents amenable to reinforcement learning (RL) and black-box agents whose behaviour must be controlled purely at test time. Although black-box agents are often backed by state-of-the-art proprietary LLMs, API-only access precludes parameter-level optimization, rendering most RL methods inapplicable. To address this limitation, we turn to a known equivalence between RL and Bayesian inference. We propose Agentic Monte Carlo (AMC) to directly sample from the optimal policy of a black-box agent rather than training it through RL. The optimal policy is a posterior over trajectories whose prior we define as the fixed black-box LLM agent. We employ Sequential Monte Carlo to sample from this posterior by learning a value function to steer the agent while leaving the underlying black-box model unchanged. We validate AMC on three diverse environments from the AgentGym benchmark, demonstrating significant improvements over prompting baselines and even outperforming Group Relative Policy Optimization (GRPO) as we scale the test-time compute of our method. AMC demonstrates the feasibility of performing principled RL-style optimization of black-box LLM agents. Code is available at https://github.com/layer6ai-labs/Agentic-Monte-Carlo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMC reframes black-box agent optimization as SMC sampling from the RL posterior with a learned value function, showing empirical gains but leaving the exactness of the target distribution unclear.

read the letter

Colleague,

The key takeaway is that this paper adapts Sequential Monte Carlo to sample optimal trajectories for black-box LLM agents by casting the problem as posterior inference with the LLM as prior and using a learned value function for steering.

What the work does well is demonstrate concrete improvements on three AgentGym environments. The method beats basic prompting and, when test-time compute is scaled, surpasses GRPO. Releasing the code is a plus for reproducibility. The framing leverages the known RL-as-Bayesian-inference equivalence in a setting where standard RL doesn't apply, which is a reasonable extension.

The soft spots are around the implementation of the steering. The stress-test concern is on point: for the samples to come from the true posterior p(τ | r), the value function needs to be derived in a way that maintains the correct importance weights. Since the LLM is black-box, accessing the necessary probabilities for exact weighting might require approximations. The abstract does not include the derivation or any analysis of bias, so it's hard to tell how close the method stays to the claimed optimal policy. If the full paper has a section showing that the learned V comes from the soft Bellman equation and that the resampling preserves the measure, that would resolve it. Otherwise the gains might come from a heuristic search rather than exact sampling.

This is relevant for anyone working on agentic systems with proprietary models. Readers interested in test-time compute scaling or SMC applications in sequential decision making would get something out of the experiments.

I would send it for peer review. The idea addresses a practical gap and the results are strong enough to justify checking the technical details.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Agentic Monte Carlo (AMC) to enable RL-style optimization for black-box LLM agents. It defines the optimal policy as the posterior p(τ | r) ∝ p_LLM(τ) ⋅ exp(r(τ)) with the fixed black-box LLM as prior, then uses Sequential Monte Carlo (SMC) guided by a learned value function V to sample from this posterior without modifying the underlying model. Experiments on three AgentGym environments report improvements over prompting baselines and GRPO when scaling test-time compute.

Significance. If the central construction is sound, AMC would allow principled optimization of proprietary black-box agents, a practically important capability. The code release aids reproducibility. The result's significance hinges on whether the value-function steering exactly preserves the target posterior measure.

major comments (2)

[§3] §3 (Method): the claim that SMC steered by the learned V samples exactly from the optimal posterior requires a derivation showing that the proposals and importance weights remain unbiased when the black-box LLM supplies no log p_LLM(a|s). The manuscript provides no such derivation or explicit importance-weight formula, which is load-bearing for the 'principled' and 'exact posterior' claims.
[§4] §4 (Experiments): the reported gains over GRPO are presented without error bars, ablation on the value-function training procedure, or analysis of how many particles are required for stable weights. This weakens the scaling claim that additional test-time compute reliably improves performance.

minor comments (2)

[Abstract / §4.1] The abstract states results on 'three diverse environments' but does not name them; the main text should list the specific AgentGym tasks in §4.1.
[§2 / §3] Notation for the value function V and the reward r(τ) should be introduced with explicit dependence on the trajectory before being used in the posterior definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our method and experiments. We address each major point below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): the claim that SMC steered by the learned V samples exactly from the optimal posterior requires a derivation showing that the proposals and importance weights remain unbiased when the black-box LLM supplies no log p_LLM(a|s). The manuscript provides no such derivation or explicit importance-weight formula, which is load-bearing for the 'principled' and 'exact posterior' claims.

Authors: We agree that an explicit derivation is required to support the exactness claim. Because the black-box LLM serves as both the prior and the proposal distribution, the prior density terms cancel in the importance weights, leaving weights that depend only on the reward and the value-function correction. We will add a dedicated subsection (and appendix) in the revision that derives the incremental weight formula step by step and proves that the resulting SMC estimator remains unbiased for the target posterior p(τ | r). revision: yes
Referee: [§4] §4 (Experiments): the reported gains over GRPO are presented without error bars, ablation on the value-function training procedure, or analysis of how many particles are required for stable weights. This weakens the scaling claim that additional test-time compute reliably improves performance.

Authors: The referee correctly identifies gaps in the experimental reporting. In the revised version we will (i) report means and standard errors over at least five independent runs, (ii) include an ablation varying the value-function architecture and training data, and (iii) plot performance versus number of particles together with effective sample size to demonstrate when weights remain stable. These additions will directly support the test-time scaling claims. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external equivalence without self-referential reduction

full rationale

The paper defines the target distribution via the standard control-as-inference equivalence (optimal policy as posterior p(τ|r) ∝ p_LLM(τ) exp(r(τ))) and proposes SMC sampling steered by a separately learned value function. This does not reduce any claimed result to a fitted input by construction, nor does it rely on self-citation for the uniqueness or validity of the equivalence. The value function is presented as an auxiliary component whose effect on importance weights is left as an empirical question rather than asserted by definitional fiat. The central claim therefore remains independent of its own outputs and is self-contained against the cited external literature on RL-Bayesian equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the RL-Bayesian equivalence (treated as background) and the learnability of a value function that can steer sampling without model access; one free parameter class is introduced via the value function.

free parameters (1)

value function parameters
Parameters of the learned value function that steers the SMC sampling process.

axioms (1)

domain assumption Equivalence between reinforcement learning and Bayesian inference
Invoked to define the optimal policy as a posterior over trajectories with the black-box agent as prior.

pith-pipeline@v0.9.1-grok · 5768 in / 1347 out tokens · 21562 ms · 2026-06-28T07:33:19.785063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 10 canonical work pages

[1]

An Introduction to Sequential Monte Carlo Methods

Doucet, Arnaud and de Freitas, Nando and Gordon, Neil. An Introduction to Sequential Monte Carlo Methods. Sequential Monte Carlo Methods in Practice. 2001. doi:10.1007/978-1-4757-3437-9_1

work page doi:10.1007/978-1-4757-3437-9_1 2001
[2]

2025 , eprint=

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems , author=. 2025 , eprint=

2025
[3]

Reinforcement learning: An introduction , author=
[4]

Neural Computation , volume=

Using expectation-maximization for reinforcement learning , author=. Neural Computation , volume=. 1997 , publisher=

1997
[5]

2006 , isbn =

Toussaint, Marc and Storkey, Amos , title =. 2006 , isbn =. doi:10.1145/1143844.1143963 , booktitle =

work page doi:10.1145/1143844.1143963 2006
[6]

The Fourteenth International Conference on Learning Representations , year=

Brendan Leigh Ross and No. The Fourteenth International Conference on Learning Representations , year=
[7]

Feng, Naihe and Sui, Yi and Hou, Shiyi and Wu, Ga and Cresswell, Jesse C , journal=
[8]

Vasileios Lioutas and Jonathan Wilder Lavington and Justice Sefas and Matthew Niedoba and Yunpeng Liu and Berend Zwartsenberg and Setareh Dabiri and Frank Wood and Adam Scibior , booktitle=
[9]

2025 , eprint=

SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems , author=. 2025 , eprint=

2025
[10]

Lew, Alexander K and Zhi-Xuan, Tan and Grand, Gabriel and Mansinghka, Vikash , booktitle=
[11]

2025 , eprint=

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents , author=. 2025 , eprint=

2025
[13]

arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv:1707.06347 , year=

Pith/arXiv arXiv
[14]

Deep Reinforcement Learning from Human Preferences , volume =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , volume =
[15]

2025 , month = nov, url =

2025
[16]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[17]

2025 , month = aug, url =

2025
[18]

arXiv:1909.08593 , year=

Fine-tuning language models from human preferences , author=. arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909
[19]

2024 , eprint=

AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems , author=. 2024 , eprint=

2024
[20]

IEE proceedings F (Radar and Signal Processing) , volume=

Novel approach to nonlinear/non-Gaussian Bayesian state estimation , author=. IEE proceedings F (Radar and Signal Processing) , volume=
[21]

Machine Learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=. 1992 , publisher=

1992
[22]

Puri, Isha and Sudalairaj, Shivchander and Xu, Guangxuan and Bhandwaldar, Abhishek and Xu, Kai and Srivastava, Akash , booktitle =
[23]

arXiv:2502.05986 , year=

Preventing rogue agents improves multi-agent collaboration , author=. arXiv:2502.05986 , year=

arXiv
[24]

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R Narasimhan and Yuan Cao , booktitle=
[25]

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , booktitle=
[26]

Xiao Yu and Baolin Peng and Vineeth Vajipey and Hao Cheng and Michel Galley and Jianfeng Gao and Zhou Yu , booktitle=. Ex
[27]

arXiv:1712.01815 , year=

Mastering chess and shogi by self-play with a general reinforcement learning algorithm , author=. arXiv:1712.01815 , year=

Pith/arXiv arXiv
[28]

Proceedings of the 41st International Conference on Machine Learning , pages=

Language agent tree search unifies reasoning, acting, and planning in language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[29]

International Conference on Learning Representations , year=

Alexandre Pich\'. International Conference on Learning Representations , year=
[30]

Levine, Sergey , journal=
[31]

Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu-Gang Jiang , journal=
[32]

A gent G ym: Evaluating and Training Large Language Model-based Agents across Diverse Environments

Xi, Zhiheng and Ding, Yiwen and Chen, Wenxiang and Hong, Boyang and Guo, Honglin and Wang, Junzhe and Guo, Xin and Yang, Dingwen and Liao, Chenyang and He, Wei and Gao, Songyang and Chen, Lu and Zheng, Rui and Zou, Yicheng and Gui, Tao and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing and Wu, Zuxuan and Jiang, Yu-Gang. A gent G ym: Evaluating and Training...

work page doi:10.18653/v1/2025.acl-long.1355 2025
[33]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , journal=
[34]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo
[35]

ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Re- flection

Kim, Jeonghye and Rhee, Sojeong and Kim, Minbeom and Kim, Dohyung and Lee, Sangmook and Sung, Youngchul and Jung, Kyomin. Refl A ct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1697

work page doi:10.18653/v1/2025.emnlp-main.1697 2025
[36]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Fleet of Agents: Coordinated Problem Solving with Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

2025
[37]

2020 , publisher=

Chopin, Nicolas and Papaspiliopoulos, Omiros and others , volume=. 2020 , publisher=

2020
[38]

arXiv:1503.02531 , year=

Distilling the knowledge in a neural network , author=. arXiv:1503.02531 , year=

Pith/arXiv arXiv
[39]

and Harada, Daishi and Russell, Stuart J

Ng, Andrew Y. and Harada, Daishi and Russell, Stuart J. , title =. Proceedings of the Sixteenth International Conference on Machine Learning , pages =. 1999 , isbn =

1999
[40]

Sutton , title =

Richard S. Sutton , title =. Machine Learning , year =. doi:10.1007/BF00115009 , issn =

work page doi:10.1007/bf00115009
[41]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =
[42]

Science W orld: Is your Agent Smarter than a 5th Grader?

Wang, Ruoyao and Jansen, Peter and C \^o t \'e , Marc-Alexandre and Ammanabrolu, Prithviraj. Science W orld: Is your Agent Smarter than a 5th Grader?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.775

work page doi:10.18653/v1/2022.emnlp-main.775 2022
[43]

ADa PT : As-Needed Decomposition and Planning with Language Models

Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar. ADa PT : As-Needed Decomposition and Planning with Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.264

work page doi:10.18653/v1/2024.findings-naacl.264 2024
[44]

RL with KL penalties is better viewed as B ayesian inference

Korbak, Tomasz and Perez, Ethan and Buckley, Christopher. RL with KL penalties is better viewed as B ayesian inference. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.77

work page doi:10.18653/v1/2022.findings-emnlp.77 2022
[45]

Probabilistic Inference in Language Models via Twisted Sequential

Zhao, Stephen and Brekelmans, Rob and Makhzani, Alireza and Grosse, Roger Baker , booktitle =. Probabilistic Inference in Language Models via Twisted Sequential. 2024 , volume =

2024
[46]

The Thirteenth International Conference on Learning Representations , year=

Jo. The Thirteenth International Conference on Learning Representations , year=
[47]

Wang, Yiping and Yang, Qing and Zeng, Zhiyuan and Ren, Liliang and Liu, Liyuan and Peng, Baolin and Cheng, Hao and He, Xuehai and Wang, Kuan and Gao, Jianfeng and Chen, Weizhu and Wang, Shuohang and Du, Simon and shen, yelong , booktitle =
[48]

Pennino, Federico and Raimondi, Bianca and Rondelli, Massimo and Gurioli, Andrea and Gabbrielli, Maurizio , journal=
[49]

2025 , eprint=

Scaling Autonomous Agents via Automatic Reward Modeling And Planning , author=. 2025 , eprint=

2025
[50]

Toolformer: Language Models Can Teach Themselves to Use Tools , volume =

Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Hambro, Eric and Zettlemoyer, Luke and Cancedda, Nicola and Scialom, Thomas , booktitle =. Toolformer: Language Models Can Teach Themselves to Use Tools , volume =
[51]

An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mi...
[52]

Aaron Grattafiori and others , journal=
[53]

2025 , howpublished=

2025
[54]

Proceedings of the 34th International Conference on Machine Learning , pages =

Reinforcement Learning with Deep Energy-Based Policies , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

2017
[55]

Proceedings of the 35th International Conference on Machine Learning , pages =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

2018
[56]

doi:10.52202/079017-2365 , pages =

Ma, Chang and Zhang, Junlei and Zhu, Zhihao and Yang, Cheng and Yang, Yujiu and Jin, Yaohui and Lan, Zhenzhong and Kong, Lingpeng and He, Junxian , booktitle =. doi:10.52202/079017-2365 , pages =

work page doi:10.52202/079017-2365
[57]

Statistics and Computing , year =

Arnaud Doucet and Simon Godsill and Christophe Andrieu , title =. Statistics and Computing , year =. doi:10.1023/A:1008935410038 , issn =

work page doi:10.1023/a:1008935410038
[58]

2004 , publisher=

Feynman-Kac formulae: genealogical and interacting particle systems with applications , author=. 2004 , publisher=

2004
[59]

Signal Processing Group, Department of Engineering, University of Cambridge, Technical Report CUEDIF-INFENGrrR38 , volume=

Convergence of sequential Monte Carlo methods , author=. Signal Processing Group, Department of Engineering, University of Cambridge, Technical Report CUEDIF-INFENGrrR38 , volume=
[60]

The Annals of Statistics , volume=

Finite-sample complexity of sequential Monte Carlo estimators , author=. The Annals of Statistics , volume=. 2023 , publisher=

2023
[61]

International Conference on Machine Learning , pages=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[1] [1]

An Introduction to Sequential Monte Carlo Methods

Doucet, Arnaud and de Freitas, Nando and Gordon, Neil. An Introduction to Sequential Monte Carlo Methods. Sequential Monte Carlo Methods in Practice. 2001. doi:10.1007/978-1-4757-3437-9_1

work page doi:10.1007/978-1-4757-3437-9_1 2001

[2] [2]

2025 , eprint=

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems , author=. 2025 , eprint=

2025

[3] [3]

Reinforcement learning: An introduction , author=

[4] [4]

Neural Computation , volume=

Using expectation-maximization for reinforcement learning , author=. Neural Computation , volume=. 1997 , publisher=

1997

[5] [5]

2006 , isbn =

Toussaint, Marc and Storkey, Amos , title =. 2006 , isbn =. doi:10.1145/1143844.1143963 , booktitle =

work page doi:10.1145/1143844.1143963 2006

[6] [6]

The Fourteenth International Conference on Learning Representations , year=

Brendan Leigh Ross and No. The Fourteenth International Conference on Learning Representations , year=

[7] [7]

Feng, Naihe and Sui, Yi and Hou, Shiyi and Wu, Ga and Cresswell, Jesse C , journal=

[8] [8]

Vasileios Lioutas and Jonathan Wilder Lavington and Justice Sefas and Matthew Niedoba and Yunpeng Liu and Berend Zwartsenberg and Setareh Dabiri and Frank Wood and Adam Scibior , booktitle=

[9] [9]

2025 , eprint=

SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems , author=. 2025 , eprint=

2025

[10] [10]

Lew, Alexander K and Zhi-Xuan, Tan and Grand, Gabriel and Mansinghka, Vikash , booktitle=

[11] [11]

2025 , eprint=

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving , author=. 2025 , eprint=

2025

[12] [12]

2025 , eprint=

On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents , author=. 2025 , eprint=

2025

[13] [13]

arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv:1707.06347 , year=

Pith/arXiv arXiv

[14] [14]

Deep Reinforcement Learning from Human Preferences , volume =

Christiano, Paul F and Leike, Jan and Brown, Tom and Martic, Miljan and Legg, Shane and Amodei, Dario , booktitle =. Deep Reinforcement Learning from Human Preferences , volume =

[15] [15]

2025 , month = nov, url =

2025

[16] [16]

Training language models to follow instructions with human feedback , volume =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

[17] [17]

2025 , month = aug, url =

2025

[18] [18]

arXiv:1909.08593 , year=

Fine-tuning language models from human preferences , author=. arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909

[19] [19]

2024 , eprint=

AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems , author=. 2024 , eprint=

2024

[20] [20]

IEE proceedings F (Radar and Signal Processing) , volume=

Novel approach to nonlinear/non-Gaussian Bayesian state estimation , author=. IEE proceedings F (Radar and Signal Processing) , volume=

[21] [21]

Machine Learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , volume=. 1992 , publisher=

1992

[22] [22]

Puri, Isha and Sudalairaj, Shivchander and Xu, Guangxuan and Bhandwaldar, Abhishek and Xu, Kai and Srivastava, Akash , booktitle =

[23] [23]

arXiv:2502.05986 , year=

Preventing rogue agents improves multi-agent collaboration , author=. arXiv:2502.05986 , year=

arXiv

[24] [24]

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R Narasimhan and Yuan Cao , booktitle=

[25] [25]

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , booktitle=

[26] [26]

Xiao Yu and Baolin Peng and Vineeth Vajipey and Hao Cheng and Michel Galley and Jianfeng Gao and Zhou Yu , booktitle=. Ex

[27] [27]

arXiv:1712.01815 , year=

Mastering chess and shogi by self-play with a general reinforcement learning algorithm , author=. arXiv:1712.01815 , year=

Pith/arXiv arXiv

[28] [28]

Proceedings of the 41st International Conference on Machine Learning , pages=

Language agent tree search unifies reasoning, acting, and planning in language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

[29] [29]

International Conference on Learning Representations , year=

Alexandre Pich\'. International Conference on Learning Representations , year=

[30] [30]

Levine, Sergey , journal=

[31] [31]

Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu-Gang Jiang , journal=

[32] [32]

A gent G ym: Evaluating and Training Large Language Model-based Agents across Diverse Environments

Xi, Zhiheng and Ding, Yiwen and Chen, Wenxiang and Hong, Boyang and Guo, Honglin and Wang, Junzhe and Guo, Xin and Yang, Dingwen and Liao, Chenyang and He, Wei and Gao, Songyang and Chen, Lu and Zheng, Rui and Zou, Yicheng and Gui, Tao and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing and Wu, Zuxuan and Jiang, Yu-Gang. A gent G ym: Evaluating and Training...

work page doi:10.18653/v1/2025.acl-long.1355 2025

[33] [33]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Xiao Bi and Haowei Zhang and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , journal=

[34] [34]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo

[35] [35]

ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Re- flection

Kim, Jeonghye and Rhee, Sojeong and Kim, Minbeom and Kim, Dohyung and Lee, Sangmook and Sung, Youngchul and Jung, Kyomin. Refl A ct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1697

work page doi:10.18653/v1/2025.emnlp-main.1697 2025

[36] [36]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Fleet of Agents: Coordinated Problem Solving with Large Language Models , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , volume =

2025

[37] [37]

2020 , publisher=

Chopin, Nicolas and Papaspiliopoulos, Omiros and others , volume=. 2020 , publisher=

2020

[38] [38]

arXiv:1503.02531 , year=

Distilling the knowledge in a neural network , author=. arXiv:1503.02531 , year=

Pith/arXiv arXiv

[39] [39]

and Harada, Daishi and Russell, Stuart J

Ng, Andrew Y. and Harada, Daishi and Russell, Stuart J. , title =. Proceedings of the Sixteenth International Conference on Machine Learning , pages =. 1999 , isbn =

1999

[40] [40]

Sutton , title =

Richard S. Sutton , title =. Machine Learning , year =. doi:10.1007/BF00115009 , issn =

work page doi:10.1007/bf00115009

[41] [41]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , volume =

[42] [42]

Science W orld: Is your Agent Smarter than a 5th Grader?

Wang, Ruoyao and Jansen, Peter and C \^o t \'e , Marc-Alexandre and Ammanabrolu, Prithviraj. Science W orld: Is your Agent Smarter than a 5th Grader?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.775

work page doi:10.18653/v1/2022.emnlp-main.775 2022

[43] [43]

ADa PT : As-Needed Decomposition and Planning with Language Models

Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar. ADa PT : As-Needed Decomposition and Planning with Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.264

work page doi:10.18653/v1/2024.findings-naacl.264 2024

[44] [44]

RL with KL penalties is better viewed as B ayesian inference

Korbak, Tomasz and Perez, Ethan and Buckley, Christopher. RL with KL penalties is better viewed as B ayesian inference. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.77

work page doi:10.18653/v1/2022.findings-emnlp.77 2022

[45] [45]

Probabilistic Inference in Language Models via Twisted Sequential

Zhao, Stephen and Brekelmans, Rob and Makhzani, Alireza and Grosse, Roger Baker , booktitle =. Probabilistic Inference in Language Models via Twisted Sequential. 2024 , volume =

2024

[46] [46]

The Thirteenth International Conference on Learning Representations , year=

Jo. The Thirteenth International Conference on Learning Representations , year=

[47] [47]

Wang, Yiping and Yang, Qing and Zeng, Zhiyuan and Ren, Liliang and Liu, Liyuan and Peng, Baolin and Cheng, Hao and He, Xuehai and Wang, Kuan and Gao, Jianfeng and Chen, Weizhu and Wang, Shuohang and Du, Simon and shen, yelong , booktitle =

[48] [48]

Pennino, Federico and Raimondi, Bianca and Rondelli, Massimo and Gurioli, Andrea and Gabbrielli, Maurizio , journal=

[49] [49]

2025 , eprint=

Scaling Autonomous Agents via Automatic Reward Modeling And Planning , author=. 2025 , eprint=

2025

[50] [50]

Toolformer: Language Models Can Teach Themselves to Use Tools , volume =

Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Hambro, Eric and Zettlemoyer, Luke and Cancedda, Nicola and Scialom, Thomas , booktitle =. Toolformer: Language Models Can Teach Themselves to Use Tools , volume =

[51] [51]

An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mi...

[52] [52]

Aaron Grattafiori and others , journal=

[53] [53]

2025 , howpublished=

2025

[54] [54]

Proceedings of the 34th International Conference on Machine Learning , pages =

Reinforcement Learning with Deep Energy-Based Policies , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

2017

[55] [55]

Proceedings of the 35th International Conference on Machine Learning , pages =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , volume =

2018

[56] [56]

doi:10.52202/079017-2365 , pages =

Ma, Chang and Zhang, Junlei and Zhu, Zhihao and Yang, Cheng and Yang, Yujiu and Jin, Yaohui and Lan, Zhenzhong and Kong, Lingpeng and He, Junxian , booktitle =. doi:10.52202/079017-2365 , pages =

work page doi:10.52202/079017-2365

[57] [57]

Statistics and Computing , year =

Arnaud Doucet and Simon Godsill and Christophe Andrieu , title =. Statistics and Computing , year =. doi:10.1023/A:1008935410038 , issn =

work page doi:10.1023/a:1008935410038

[58] [58]

2004 , publisher=

Feynman-Kac formulae: genealogical and interacting particle systems with applications , author=. 2004 , publisher=

2004

[59] [59]

Signal Processing Group, Department of Engineering, University of Cambridge, Technical Report CUEDIF-INFENGrrR38 , volume=

Convergence of sequential Monte Carlo methods , author=. Signal Processing Group, Department of Engineering, University of Cambridge, Technical Report CUEDIF-INFENGrrR38 , volume=

[60] [60]

The Annals of Statistics , volume=

Finite-sample complexity of sequential Monte Carlo estimators , author=. The Annals of Statistics , volume=. 2023 , publisher=

2023

[61] [61]

International Conference on Machine Learning , pages=

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025