arxiv: 2605.08315 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara , Vaibbhav Murarri , Claudio Zito

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM policy optimizationtrajectory feedbacksalience biasreflective promptingreinforcement learningpolicy searchagent trajectoriesmulti-environment evaluation

0 comments

The pith

A Critic-LLM that inspects full agent trajectories proposes targeted revisions that let policies reach higher rewards faster and more stably than scalar-reward baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LLM-based policy optimizers receive only a scalar reward and therefore cannot distinguish whether an agent looped, fell into a hole, or failed on one rollout out of twenty. R2PO adds a second stage in which a Critic-LLM receives the actual sequences of states, actions, and rewards and outputs concrete parameter changes grounded in those observations. Across ten environments this two-stage process produces the highest mean best reward and reaches near-maximum performance in substantially fewer episodes than either deep RL or earlier LLM methods. The work also isolates salience bias, the tendency of the Critic-LLM to over-focus on a single poor trajectory even when most succeed, and shows that aggregate statistics plus median selection largely eliminate the resulting regressions.

Core claim

R2PO is a two-stage framework in which a Search-LLM proposes candidate policy parameters, the environment executes them to generate rollouts, and a Critic-LLM examines those rollouts to propose targeted revisions grounded in observed states, actions, and rewards. Ablations establish that separating global search from behavior-grounded revision and filtering high-variance edits are both necessary for the observed gains. The framework identifies salience bias as a dominant failure mode in which the Critic-LLM fixates on improving a single failure even when most trajectories succeed; this accounts for 76.6 percent of regressions in a three-trajectory CartPole variant. With a 20B open-weight LLM

What carries the argument

The Critic-LLM that receives full execution trajectories and generates behavior-grounded policy revisions, together with aggregate rollout statistics and median-trajectory selection to counteract salience bias.

If this is right

Separating policy proposal from trajectory-based critique is required to obtain reliable gains over scalar-reward LLM methods.
Using aggregate statistics and median selection reduces performance regressions caused by salience bias.
Near-optimal performance is reached in far fewer episodes when behavioral evidence is supplied instead of compressed rewards.
Training stability improves when high-variance edits are filtered before the policy is updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The salience-bias finding indicates that multi-example prompting in variable-outcome settings requires explicit mechanisms to avoid outlier fixation.
Trajectory-grounded revision may extend to other sequential decision tasks where execution traces are available as in-context evidence.
The performance gap over scalar methods suggests that in-context behavioral feedback can serve as a lightweight alternative to gradient updates for certain policy classes.

Load-bearing premise

The Critic-LLM can produce targeted revisions from trajectories that reliably improve the policy rather than introducing new errors or high-variance changes.

What would settle it

An ablation in which the Critic-LLM receives only scalar rewards and no trajectory details fails to produce faster convergence or higher final rewards than scalar-only baselines on the same ten environments.

Figures

Figures reproduced from arXiv: 2605.08315 by Claudio Zito, Rahaf Abu Hara, Vaibbhav Murarri.

**Figure 1.** Figure 1: The two-stage R2PO framework. The Search-LLM proposes candidate parameters θinit conditioned on a reward-only replay history {(θi , R¯ i)} t−1 i=1 of previously evaluated parameters and their mean rewards. The environment evaluates θinit over K rollouts, returning the mean reward and rollout trajectories from which Trajectory Evidence (orange, dashed) is constructed. The Critic-LLM uses this evidence to pr… view at source ↗

**Figure 2.** Figure 2: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB3 baseline on four representative environments. R2PO reaches strong performance earlier and maintains it more consistently than other baselines. Curves for the remaining six environments are in Appendix G and show the same overall pattern. ask two questions: (1) which components are actually necessa… view at source ↗

**Figure 3.** Figure 3: Two representative R2PO revision episodes. Example 1 shows a conservative one-parameter [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB3 baseline across all ten environments. misdiagnosis; the permissive criterion additionally counts cases where the worst trajectory reflects a ≥ 50% failure mode but is still cited by the Critic-LLM as the central evidence for revision. Results. The strict salience proxy yields different rates on Ca… view at source ↗

read the original abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2PO shows that feeding full trajectories to a Critic-LLM improves policy search over scalar rewards, but the ablations do not yet isolate how much the trajectory details themselves drive the gains.

read the letter

The paper's main move is to split LLM policy search into a Search-LLM that proposes parameters and a Critic-LLM that reads actual state-action-reward sequences and suggests targeted fixes. They test this on ten environments with a 20B model and report higher peak rewards, faster convergence, and lower variance than both standard deep RL and earlier LLM-only baselines. They also document salience bias, where the critic over-focuses on one bad rollout even when most succeed, and they reduce it by feeding aggregate statistics plus the median trajectory.

Referee Report

3 major / 3 minor

Summary. The paper introduces Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search in compact policy classes. A Search-LLM generates candidate parameters; the environment produces rollouts; a Critic-LLM then inspects full trajectories (states, actions, rewards) to propose targeted revisions. The work identifies 'salience bias'—LLM fixation on single failures amid mostly successful rollouts—and proposes mitigations via aggregate statistics, median selection, and revision rules. Ablations across ten environments confirm that separating global search from revision and filtering high-variance edits are necessary. Using a 20B open-weight model, R2PO reports the highest mean best reward, substantially faster convergence (e.g., near-max CartPole reward in ~500 episodes), and greater stability than deep RL and prior LLM-based optimizers.

Significance. If the empirical claims hold, the work provides concrete evidence that supplying full trajectories as in-context evidence, rather than reducing them to scalars, materially improves LLM-driven policy optimization. The identification and mitigation of salience bias offers a useful diagnostic for LLM limitations in sequential decision tasks. Strengths include the use of an open-weight 20B model, systematic ablations demonstrating component necessity, and a falsifiable failure-mode analysis. These elements could influence hybrid LLM-RL designs by showing that behavior-grounded prompting enables smaller models to learn faster and more stably than scalar-reward baselines.

major comments (3)

[Ablation studies] Ablation studies (described in the experiments section): while the paper shows that removing the Critic-LLM or trajectory input degrades performance, there is no controlled comparison in which the Critic-LLM receives only scalar rewards or aggregate statistics versus full trajectories. Without this isolation, it remains possible that gains arise from the outer selection loop or extra LLM calls rather than from trajectory-grounded revisions specifically.
[Salience Bias Analysis] Salience-bias analysis (CartPole three-trajectory variant): the claim that salience bias explains 76.6% of regressions is presented without the absolute count of regressions, variance across seeds, or a direct comparison to a scalar-reward prompting condition. This weakens the assertion that the proposed aggregate-statistics and median-selection rules are the primary mitigators rather than incidental effects of the overall framework.
[Experimental Results] Experimental results (headline performance claims): the statements that R2PO achieves the highest mean best reward across all ten environments and reaches near-optimal performance substantially earlier lack reported standard errors, number of independent runs, and statistical significance tests against baselines. These details are load-bearing for the stability and superiority conclusions.

minor comments (3)

[Introduction / Method] The abstract and method description use 'compact policy classes' without an explicit definition or example parameterization in the main text; a short clarifying paragraph or table would help readers replicate the search space.
[Figures] Figure captions for learning curves should include the exact number of episodes shown and whether shaded regions represent standard error or min/max across runs.
[Related Work] The paper cites prior LLM-based policy optimizers but does not discuss how R2PO's two-stage prompting differs from chain-of-thought or self-refinement techniques in the broader LLM literature; a brief related-work paragraph would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical isolation of trajectory-grounded revisions, the salience-bias analysis, and the statistical reporting of results. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: Ablation studies (described in the experiments section): while the paper shows that removing the Critic-LLM or trajectory input degrades performance, there is no controlled comparison in which the Critic-LLM receives only scalar rewards or aggregate statistics versus full trajectories. Without this isolation, it remains possible that gains arise from the outer selection loop or extra LLM calls rather than from trajectory-grounded revisions specifically.

Authors: We agree this isolation would more precisely attribute gains to trajectory input. The existing ablations remove the Critic-LLM entirely or replace trajectory input with no behavioral evidence, but do not hold the Critic-LLM fixed while varying only scalar versus full-trajectory prompting. We will add this controlled ablation in the revised experiments section, prompting the Critic-LLM with scalar rewards plus aggregates versus full state-action-reward sequences while keeping the Search-LLM and selection rules identical. revision: yes
Referee: Salience-bias analysis (CartPole three-trajectory variant): the claim that salience bias explains 76.6% of regressions is presented without the absolute count of regressions, variance across seeds, or a direct comparison to a scalar-reward prompting condition. This weakens the assertion that the proposed aggregate-statistics and median-selection rules are the primary mitigators rather than incidental effects of the overall framework.

Authors: The 76.6% figure was obtained by inspecting all regression cases in the three-trajectory CartPole runs and counting those where the Critic-LLM revised based solely on the worst trajectory. We will report the absolute counts (e.g., 23 of 30 regressions) and variance across the five seeds in the revised manuscript. A direct scalar-reward prompting condition for the Critic-LLM was not run because the framework is defined around multi-trajectory input; however, the overall ablation removing trajectory input already shows performance degradation, and we will add a brief discussion clarifying that scalar prompting sidesteps salience bias by construction but forgoes the diagnostic information that enables targeted revisions. revision: partial
Referee: Experimental results (headline performance claims): the statements that R2PO achieves the highest mean best reward across all ten environments and reaches near-optimal performance substantially earlier lack reported standard errors, number of independent runs, and statistical significance tests against baselines. These details are load-bearing for the stability and superiority conclusions.

Authors: We will update all result tables and figures to report the number of independent runs (five seeds per environment), include standard-error bars on learning curves, and add paired t-test p-values for the mean-best-reward and convergence-time comparisons against each baseline. These statistics were computed from the existing runs and will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical method with no derivation chain

full rationale

The paper presents R2PO as an empirical LLM-based policy optimization framework relying on trajectory inspection by a Critic-LLM, ablations across ten environments, and direct comparisons to baselines. No mathematical derivations, equations, or first-principles results are claimed or present that could reduce outputs to inputs by construction. Claims of superior mean best reward, faster convergence, and stability are supported by experimental outcomes rather than analytical self-reference. Any self-citations (if present) are not load-bearing for the core empirical results, which are externally falsifiable via replication on the stated environments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The proposal relies on standard LLM prompting capabilities and existing RL environment execution without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5643 in / 1359 out tokens · 82467 ms · 2026-05-12T00:47:12.737081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

[1]

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents, 2026

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, and Holger Boche. TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents, 2026. URL https: //arxiv.org/abs/2602.11767

work page arXiv 2026
[2]

Addressing Function Approximation Error in Actor-Critic Methods.International Conference on Machine Learning, pages 1587–1596, 2018

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods.International Conference on Machine Learning, pages 1587–1596, 2018

work page 2018
[3]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InInternational Conference on Machine Learning, pages 1861–1870, 2018

work page 2018
[4]

Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control.IEEE Access, 12: 89479–89492, 2024

Yanqiao Han, Menglong Yang, Yang Ren, and Weizheng Li. Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control.IEEE Access, 12: 89479–89492, 2024. doi: 10.1109/ACCESS.2024.3411015

work page doi:10.1109/access.2024.3411015 2024
[5]

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025. URL https://arxiv.org/abs/ 2506.11902

work page arXiv 2025
[6]

Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward Design with Language Models, 2023. URLhttps://arxiv.org/abs/2303.00001

work page arXiv 2023
[7]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016

work page 2016
[8]

DrEureka: Language Model Guided Sim-To-Real Transfer

Yecheng Jason Ma, William Liang, Hung-Ju Wang, Sam Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman. DrEureka: Language Model Guided Sim-To-Real Transfer. In Robotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2406.01967

work page arXiv 2024
[9]

Self- Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...

work page 2023
[10]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

work page 2015
[11]

Asynchronous Methods for Deep Re- inforcement Learning.International Conference on Machine Learning, pages 1928–1937, 2016

V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Re- inforcement Learning.International Conference on Machine Learning, pages 1928–1937, 2016

work page 1928
[12]

gpt-oss:20b-cloud, 2025

Ollama. gpt-oss:20b-cloud, 2025. URL https://ollama.com/library/gpt-oss: 20b-cloud

work page 2025
[13]

Introducing gpt-oss, 2025

OpenAI. Introducing gpt-oss, 2025. URL https://openai.com/index/ introducing-gpt-oss/

work page 2025
[14]

RL Baselines3 Zoo

Antonin Raffin. RL Baselines3 Zoo. https://github.com/DLR-RM/rl-baselines3-zoo , 2020

work page 2020
[15]

Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research, 22(268):1–8, 2021. 10

work page 2021
[16]

Trust Region Policy Optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. InInternational Conference on Machine Learning, pages 1889–1897, 2015

work page 2015
[17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=vAElhFcKW6

work page 2023
[19]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033. IEEE, 2012

work page 2012
[20]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Enhancing Q-Learning with Large Language Model Heuristics, 2024

Xiefeng Wu. Enhancing Q-Learning with Large Language Model Heuristics, 2024. URL https://arxiv.org/abs/2405.03341

work page arXiv 2024
[22]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large Language Models as Optimizers. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Bb4VGOWELI

work page 2024
[23]

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs,

Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, and Heni Ben Amor. Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs,

work page
[24]

URLhttps://arxiv.org/abs/2511.21928

work page arXiv
[25]

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Zihao Zhou, Bin Hu, Chenyang Zhao, Pu Zhang, and Bin Liu. Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pages 5671–5679, 2024. doi: 10.24963/ijcai.2024/627. URLhttps://arxiv.org/abs/2311.13373. 11 Appendix Table of...

work page doi:10.24963/ijcai.2024/627 2024
[26]

Critic-LLM’s reasoning explicitly references the worst rollout

work page
[27]

The worst rollout’s return is strictly lower than the median (strictcriterion), so the worst is genuinely an outlier rather than the modal outcome

work page
[28]

do-nothing

The revision hurt performance:∆<0. We also report apermissivevariant that relaxes condition (2) to require only that the worst be no greater than the median and strictly less than the best. The strict criterion isolates outlier-driven 18 Figure 4: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB...

work page
[29]

Replaced by action 1, which only risks moving left to safe state 6 or staying, eliminating the up-hole accidental step

State 7 - originally action 3 had a 1/3 chance of moving up into hole 3. Replaced by action 1, which only risks moving left to safe state 6 or staying, eliminating the up-hole accidental step

work page
[30]

Switched to action 2, which keeps the agent within the goal corridor (up to 7 or down to 15) and never reaches the hole directly

State 11 - remaining in column 3, action 0 risked falling left into hole 10. Switched to action 2, which keeps the agent within the goal corridor (up to 7 or down to 15) and never reaches the hole directly. All other actions retain the values that previously yielded high reward (0.8 in Trial 2). This modest, focused adjustment is expected to increase the ...

work page
[32]

Please propose params values in the range [-6.0, 6.0], with 1 decimal place

You will provide your response in the following exact format: * Line 1: a new input ‘params[0]: , params[1]: , ..., params[<RANK-1>]:‘ aiming to maximize the function’s valuef(params). Please propose params values in the range [-6.0, 6.0], with 1 decimal place. * Line 2: detailed explanation of why you chose that input

work page
[36]

If you are below that, 27 this is just a local optimum

The global optimum should be around <OPTIMUM>. If you are below that, 27 this is just a local optimum. You should explore instead of exploiting

work page
[37]

During exploration, use search step size of <STEP_SIZE>

Search both positive and negative values. During exploration, use search step size of <STEP_SIZE>. # Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format. Next, you will see examples of params andf(params)pairs. <HISTORY> Now you are at iteration <STEP_NUMBER> out of <MAX_ITERATIONS>. Pleas...

work page
[38]

I will first provide MAX_STEPS (<MAX_ITERATIONS>) along with a few training examples

work page
[39]

Please propose params values from <ACTIONS>

You will provide your response in the following exact format: * Line 1: a new input ‘params[0]: , params[1]: , ..., params[<RANK-1>]:‘ aiming to maximize the function’s valuef(params). Please propose params values from <ACTIONS>. * Line 2: detailed explanation of why you chose that input

work page
[40]

I will then provide the function’s valuef(params)at that point, and the current iteration

work page
[41]

# Remember:

We will repeat steps 2–3 until we reach the maximum number of iterations. # Remember:

work page
[42]

Do not propose previously seen params

work page
[43]

If you are below that, this is just a local optimum

The global optimum should be around <OPTIMUM>. If you are below that, this is just a local optimum. You should explore instead of exploiting

work page
[44]

# Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format

Search all the possible values of params. # Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format. Next, you will see examples of params andf(params)pairs. <HISTORY> Now you are at iteration <STEP_NUMBER> out of <MAX_ITERATIONS>. Please provide the results in the indicated format. Do not pro...

work page
[45]

Use the reward and episode-length statistics to separate systematic failures from occasional ones

work page
[46]

Identify which state dimensions and action responses are associated with the representative median rollout

work page
[47]

Preserve behaviors that are consistent with the stronger aggregate statistics

work page
[48]

# Important points of consideration: - Use the reward and length statistics to judge whether the policy is consistently weak or only occasionally failing

Revise only the parameters most responsible for instability or underperformance. # Important points of consideration: - Use the reward and length statistics to judge whether the policy is consistently weak or only occasionally failing. - If success rate is already high, avoid overfitting to the representative trajectory. - Use the median rollout as eviden...

work page
[49]

Use the reward and episode-length statistics to identify whether failures are 29 consistent versus occasional

work page
[50]

Use the reward and episode-length statistics to judge whether failures are systematic or mostly due to stochasticity

work page
[51]

Locate the state-action decisions in the policy that most likely explain the representative median rollout and the aggregate statistics

work page
[52]

# Important points of consideration: - Use the reward and length statistics to distinguish consistent problems from occasional failures

Revise only the parameters most responsible for the failures or missed opportunities. # Important points of consideration: - Use the reward and length statistics to distinguish consistent problems from occasional failures. - If success rate is already high, avoid over-correcting based on a single trajectory. - Use the median rollout as a representative ex...

work page 2025