pith. machine review for the scientific record. sign in

arxiv: 2605.08315 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM policy optimizationtrajectory feedbacksalience biasreflective promptingreinforcement learningpolicy searchagent trajectoriesmulti-environment evaluation
0
0 comments X

The pith

A Critic-LLM that inspects full agent trajectories proposes targeted revisions that let policies reach higher rewards faster and more stably than scalar-reward baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LLM-based policy optimizers receive only a scalar reward and therefore cannot distinguish whether an agent looped, fell into a hole, or failed on one rollout out of twenty. R2PO adds a second stage in which a Critic-LLM receives the actual sequences of states, actions, and rewards and outputs concrete parameter changes grounded in those observations. Across ten environments this two-stage process produces the highest mean best reward and reaches near-maximum performance in substantially fewer episodes than either deep RL or earlier LLM methods. The work also isolates salience bias, the tendency of the Critic-LLM to over-focus on a single poor trajectory even when most succeed, and shows that aggregate statistics plus median selection largely eliminate the resulting regressions.

Core claim

R2PO is a two-stage framework in which a Search-LLM proposes candidate policy parameters, the environment executes them to generate rollouts, and a Critic-LLM examines those rollouts to propose targeted revisions grounded in observed states, actions, and rewards. Ablations establish that separating global search from behavior-grounded revision and filtering high-variance edits are both necessary for the observed gains. The framework identifies salience bias as a dominant failure mode in which the Critic-LLM fixates on improving a single failure even when most trajectories succeed; this accounts for 76.6 percent of regressions in a three-trajectory CartPole variant. With a 20B open-weight LLM

What carries the argument

The Critic-LLM that receives full execution trajectories and generates behavior-grounded policy revisions, together with aggregate rollout statistics and median-trajectory selection to counteract salience bias.

If this is right

  • Separating policy proposal from trajectory-based critique is required to obtain reliable gains over scalar-reward LLM methods.
  • Using aggregate statistics and median selection reduces performance regressions caused by salience bias.
  • Near-optimal performance is reached in far fewer episodes when behavioral evidence is supplied instead of compressed rewards.
  • Training stability improves when high-variance edits are filtered before the policy is updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The salience-bias finding indicates that multi-example prompting in variable-outcome settings requires explicit mechanisms to avoid outlier fixation.
  • Trajectory-grounded revision may extend to other sequential decision tasks where execution traces are available as in-context evidence.
  • The performance gap over scalar methods suggests that in-context behavioral feedback can serve as a lightweight alternative to gradient updates for certain policy classes.

Load-bearing premise

The Critic-LLM can produce targeted revisions from trajectories that reliably improve the policy rather than introducing new errors or high-variance changes.

What would settle it

An ablation in which the Critic-LLM receives only scalar rewards and no trajectory details fails to produce faster convergence or higher final rewards than scalar-only baselines on the same ten environments.

Figures

Figures reproduced from arXiv: 2605.08315 by Claudio Zito, Rahaf Abu Hara, Vaibbhav Murarri.

Figure 1
Figure 1. Figure 1: The two-stage R2PO framework. The Search-LLM proposes candidate parameters θinit conditioned on a reward-only replay history {(θi , R¯ i)} t−1 i=1 of previously evaluated parameters and their mean rewards. The environment evaluates θinit over K rollouts, returning the mean reward and rollout trajectories from which Trajectory Evidence (orange, dashed) is constructed. The Critic-LLM uses this evidence to pr… view at source ↗
Figure 2
Figure 2. Figure 2: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB3 baseline on four representative environments. R2PO reaches strong performance earlier and maintains it more consistently than other baselines. Curves for the remaining six environments are in Appendix G and show the same overall pattern. ask two questions: (1) which components are actually necessa… view at source ↗
Figure 3
Figure 3. Figure 3: Two representative R2PO revision episodes. Example 1 shows a conservative one-parameter [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB3 baseline across all ten environments. misdiagnosis; the permissive criterion additionally counts cases where the worst trajectory reflects a ≥ 50% failure mode but is still cited by the Critic-LLM as the central evidence for revision. Results. The strict salience proxy yields different rates on Ca… view at source ↗
read the original abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search in compact policy classes. A Search-LLM generates candidate parameters; the environment produces rollouts; a Critic-LLM then inspects full trajectories (states, actions, rewards) to propose targeted revisions. The work identifies 'salience bias'—LLM fixation on single failures amid mostly successful rollouts—and proposes mitigations via aggregate statistics, median selection, and revision rules. Ablations across ten environments confirm that separating global search from revision and filtering high-variance edits are necessary. Using a 20B open-weight model, R2PO reports the highest mean best reward, substantially faster convergence (e.g., near-max CartPole reward in ~500 episodes), and greater stability than deep RL and prior LLM-based optimizers.

Significance. If the empirical claims hold, the work provides concrete evidence that supplying full trajectories as in-context evidence, rather than reducing them to scalars, materially improves LLM-driven policy optimization. The identification and mitigation of salience bias offers a useful diagnostic for LLM limitations in sequential decision tasks. Strengths include the use of an open-weight 20B model, systematic ablations demonstrating component necessity, and a falsifiable failure-mode analysis. These elements could influence hybrid LLM-RL designs by showing that behavior-grounded prompting enables smaller models to learn faster and more stably than scalar-reward baselines.

major comments (3)
  1. [Ablation studies] Ablation studies (described in the experiments section): while the paper shows that removing the Critic-LLM or trajectory input degrades performance, there is no controlled comparison in which the Critic-LLM receives only scalar rewards or aggregate statistics versus full trajectories. Without this isolation, it remains possible that gains arise from the outer selection loop or extra LLM calls rather than from trajectory-grounded revisions specifically.
  2. [Salience Bias Analysis] Salience-bias analysis (CartPole three-trajectory variant): the claim that salience bias explains 76.6% of regressions is presented without the absolute count of regressions, variance across seeds, or a direct comparison to a scalar-reward prompting condition. This weakens the assertion that the proposed aggregate-statistics and median-selection rules are the primary mitigators rather than incidental effects of the overall framework.
  3. [Experimental Results] Experimental results (headline performance claims): the statements that R2PO achieves the highest mean best reward across all ten environments and reaches near-optimal performance substantially earlier lack reported standard errors, number of independent runs, and statistical significance tests against baselines. These details are load-bearing for the stability and superiority conclusions.
minor comments (3)
  1. [Introduction / Method] The abstract and method description use 'compact policy classes' without an explicit definition or example parameterization in the main text; a short clarifying paragraph or table would help readers replicate the search space.
  2. [Figures] Figure captions for learning curves should include the exact number of episodes shown and whether shaded regions represent standard error or min/max across runs.
  3. [Related Work] The paper cites prior LLM-based policy optimizers but does not discuss how R2PO's two-stage prompting differs from chain-of-thought or self-refinement techniques in the broader LLM literature; a brief related-work paragraph would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical isolation of trajectory-grounded revisions, the salience-bias analysis, and the statistical reporting of results. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: Ablation studies (described in the experiments section): while the paper shows that removing the Critic-LLM or trajectory input degrades performance, there is no controlled comparison in which the Critic-LLM receives only scalar rewards or aggregate statistics versus full trajectories. Without this isolation, it remains possible that gains arise from the outer selection loop or extra LLM calls rather than from trajectory-grounded revisions specifically.

    Authors: We agree this isolation would more precisely attribute gains to trajectory input. The existing ablations remove the Critic-LLM entirely or replace trajectory input with no behavioral evidence, but do not hold the Critic-LLM fixed while varying only scalar versus full-trajectory prompting. We will add this controlled ablation in the revised experiments section, prompting the Critic-LLM with scalar rewards plus aggregates versus full state-action-reward sequences while keeping the Search-LLM and selection rules identical. revision: yes

  2. Referee: Salience-bias analysis (CartPole three-trajectory variant): the claim that salience bias explains 76.6% of regressions is presented without the absolute count of regressions, variance across seeds, or a direct comparison to a scalar-reward prompting condition. This weakens the assertion that the proposed aggregate-statistics and median-selection rules are the primary mitigators rather than incidental effects of the overall framework.

    Authors: The 76.6% figure was obtained by inspecting all regression cases in the three-trajectory CartPole runs and counting those where the Critic-LLM revised based solely on the worst trajectory. We will report the absolute counts (e.g., 23 of 30 regressions) and variance across the five seeds in the revised manuscript. A direct scalar-reward prompting condition for the Critic-LLM was not run because the framework is defined around multi-trajectory input; however, the overall ablation removing trajectory input already shows performance degradation, and we will add a brief discussion clarifying that scalar prompting sidesteps salience bias by construction but forgoes the diagnostic information that enables targeted revisions. revision: partial

  3. Referee: Experimental results (headline performance claims): the statements that R2PO achieves the highest mean best reward across all ten environments and reaches near-optimal performance substantially earlier lack reported standard errors, number of independent runs, and statistical significance tests against baselines. These details are load-bearing for the stability and superiority conclusions.

    Authors: We will update all result tables and figures to report the number of independent runs (five seeds per environment), include standard-error bars on learning curves, and add paired t-test p-values for the mean-best-reward and convergence-time comparisons against each baseline. These statistics were computed from the existing runs and will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical method with no derivation chain

full rationale

The paper presents R2PO as an empirical LLM-based policy optimization framework relying on trajectory inspection by a Critic-LLM, ablations across ten environments, and direct comparisons to baselines. No mathematical derivations, equations, or first-principles results are claimed or present that could reduce outputs to inputs by construction. Claims of superior mean best reward, faster convergence, and stability are supported by experimental outcomes rather than analytical self-reference. Any self-citations (if present) are not load-bearing for the core empirical results, which are externally falsifiable via replication on the stated environments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The proposal relies on standard LLM prompting capabilities and existing RL environment execution without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5643 in / 1359 out tokens · 82467 ms · 2026-05-12T00:47:12.737081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

  1. [1]

    TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents, 2026

    Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Heiko Ludwig, and Holger Boche. TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents, 2026. URL https: //arxiv.org/abs/2602.11767

  2. [2]

    Addressing Function Approximation Error in Actor-Critic Methods.International Conference on Machine Learning, pages 1587–1596, 2018

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods.International Conference on Machine Learning, pages 1587–1596, 2018

  3. [3]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InInternational Conference on Machine Learning, pages 1861–1870, 2018

  4. [4]

    Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control.IEEE Access, 12: 89479–89492, 2024

    Yanqiao Han, Menglong Yang, Yang Ren, and Weizheng Li. Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control.IEEE Access, 12: 89479–89492, 2024. doi: 10.1109/ACCESS.2024.3411015

  5. [5]

    TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025

    Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025. URL https://arxiv.org/abs/ 2506.11902

  6. [6]

    Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

    Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward Design with Language Models, 2023. URLhttps://arxiv.org/abs/2303.00001

  7. [7]

    Lillicrap, Jonathan J

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016

  8. [8]

    DrEureka: Language Model Guided Sim-To-Real Transfer

    Yecheng Jason Ma, William Liang, Hung-Ju Wang, Sam Wang, Yuke Zhu, Linxi Fan, Osbert Bastani, and Dinesh Jayaraman. DrEureka: Language Model Guided Sim-To-Real Transfer. In Robotics: Science and Systems (RSS), 2024. URLhttps://arxiv.org/abs/2406.01967

  9. [9]

    Self- Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...

  10. [10]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

  11. [11]

    Asynchronous Methods for Deep Re- inforcement Learning.International Conference on Machine Learning, pages 1928–1937, 2016

    V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Re- inforcement Learning.International Conference on Machine Learning, pages 1928–1937, 2016

  12. [12]

    gpt-oss:20b-cloud, 2025

    Ollama. gpt-oss:20b-cloud, 2025. URL https://ollama.com/library/gpt-oss: 20b-cloud

  13. [13]

    Introducing gpt-oss, 2025

    OpenAI. Introducing gpt-oss, 2025. URL https://openai.com/index/ introducing-gpt-oss/

  14. [14]

    RL Baselines3 Zoo

    Antonin Raffin. RL Baselines3 Zoo. https://github.com/DLR-RM/rl-baselines3-zoo , 2020

  15. [15]

    Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research, 22(268):1–8, 2021. 10

  16. [16]

    Trust Region Policy Optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. InInternational Conference on Machine Learning, pages 1889–1897, 2015

  17. [17]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

  18. [18]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=vAElhFcKW6

  19. [19]

    MuJoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033. IEEE, 2012

  20. [20]

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17...

  21. [21]

    Enhancing Q-Learning with Large Language Model Heuristics, 2024

    Xiefeng Wu. Enhancing Q-Learning with Large Language Model Heuristics, 2024. URL https://arxiv.org/abs/2405.03341

  22. [22]

    Le, Denny Zhou, and Xinyun Chen

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. Large Language Models as Optimizers. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id= Bb4VGOWELI

  23. [23]

    Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs,

    Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, and Heni Ben Amor. Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs,

  24. [24]

    URLhttps://arxiv.org/abs/2511.21928

  25. [25]

    Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

    Zihao Zhou, Bin Hu, Chenyang Zhao, Pu Zhang, and Bin Liu. Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pages 5671–5679, 2024. doi: 10.24963/ijcai.2024/627. URLhttps://arxiv.org/abs/2311.13373. 11 Appendix Table of...

  26. [26]

    Critic-LLM’s reasoning explicitly references the worst rollout

  27. [27]

    The worst rollout’s return is strictly lower than the median (strictcriterion), so the worst is genuinely an outlier rather than the modal outcome

  28. [28]

    do-nothing

    The revision hurt performance:∆<0. We also report apermissivevariant that relaxes condition (2) to require only that the worst be no greater than the median and strictly less than the best. The strict criterion isolates outlier-driven 18 Figure 4: Learning curves (mean ± standard deviation over 10 independent runs) for R2PO, ProPS, ProPS+, and the best SB...

  29. [29]

    Replaced by action 1, which only risks moving left to safe state 6 or staying, eliminating the up-hole accidental step

    State 7 - originally action 3 had a 1/3 chance of moving up into hole 3. Replaced by action 1, which only risks moving left to safe state 6 or staying, eliminating the up-hole accidental step

  30. [30]

    Switched to action 2, which keeps the agent within the goal corridor (up to 7 or down to 15) and never reaches the hole directly

    State 11 - remaining in column 3, action 0 risked falling left into hole 10. Switched to action 2, which keeps the agent within the goal corridor (up to 7 or down to 15) and never reaches the hole directly. All other actions retain the values that previously yielded high reward (0.8 in Trial 2). This modest, focused adjustment is expected to increase the ...

  31. [32]

    Please propose params values in the range [-6.0, 6.0], with 1 decimal place

    You will provide your response in the following exact format: * Line 1: a new input ‘params[0]: , params[1]: , ..., params[<RANK-1>]:‘ aiming to maximize the function’s valuef(params). Please propose params values in the range [-6.0, 6.0], with 1 decimal place. * Line 2: detailed explanation of why you chose that input

  32. [36]

    If you are below that, 27 this is just a local optimum

    The global optimum should be around <OPTIMUM>. If you are below that, 27 this is just a local optimum. You should explore instead of exploiting

  33. [37]

    During exploration, use search step size of <STEP_SIZE>

    Search both positive and negative values. During exploration, use search step size of <STEP_SIZE>. # Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format. Next, you will see examples of params andf(params)pairs. <HISTORY> Now you are at iteration <STEP_NUMBER> out of <MAX_ITERATIONS>. Pleas...

  34. [38]

    I will first provide MAX_STEPS (<MAX_ITERATIONS>) along with a few training examples

  35. [39]

    Please propose params values from <ACTIONS>

    You will provide your response in the following exact format: * Line 1: a new input ‘params[0]: , params[1]: , ..., params[<RANK-1>]:‘ aiming to maximize the function’s valuef(params). Please propose params values from <ACTIONS>. * Line 2: detailed explanation of why you chose that input

  36. [40]

    I will then provide the function’s valuef(params)at that point, and the current iteration

  37. [41]

    # Remember:

    We will repeat steps 2–3 until we reach the maximum number of iterations. # Remember:

  38. [42]

    Do not propose previously seen params

  39. [43]

    If you are below that, this is just a local optimum

    The global optimum should be around <OPTIMUM>. If you are below that, this is just a local optimum. You should explore instead of exploiting

  40. [44]

    # Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format

    Search all the possible values of params. # Previous optimization history For reference, here are previously tried policies and their outcomes in a compact format. Next, you will see examples of params andf(params)pairs. <HISTORY> Now you are at iteration <STEP_NUMBER> out of <MAX_ITERATIONS>. Please provide the results in the indicated format. Do not pro...

  41. [45]

    Use the reward and episode-length statistics to separate systematic failures from occasional ones

  42. [46]

    Identify which state dimensions and action responses are associated with the representative median rollout

  43. [47]

    Preserve behaviors that are consistent with the stronger aggregate statistics

  44. [48]

    # Important points of consideration: - Use the reward and length statistics to judge whether the policy is consistently weak or only occasionally failing

    Revise only the parameters most responsible for instability or underperformance. # Important points of consideration: - Use the reward and length statistics to judge whether the policy is consistently weak or only occasionally failing. - If success rate is already high, avoid overfitting to the representative trajectory. - Use the median rollout as eviden...

  45. [49]

    Use the reward and episode-length statistics to identify whether failures are 29 consistent versus occasional

  46. [50]

    Use the reward and episode-length statistics to judge whether failures are systematic or mostly due to stochasticity

  47. [51]

    Locate the state-action decisions in the policy that most likely explain the representative median rollout and the aggregate statistics

  48. [52]

    # Important points of consideration: - Use the reward and length statistics to distinguish consistent problems from occasional failures

    Revise only the parameters most responsible for the failures or missed opportunities. # Important points of consideration: - Use the reward and length statistics to distinguish consistent problems from occasional failures. - If success rate is already high, avoid over-correcting based on a single trajectory. - Use the median rollout as a representative ex...